1 Morning

REFRESHER: Using R Studio

TIPS

  • Use Tab to auto complete
  • Use up arrow to get previous command

1.0.1 Create a new project in R Studio

  1. From File menu select New Project…
  2. From New Project Dialog select New Directory
  3. Select New Project as project type.
  4. Give a project name, i.e. “Workshop” and click Create Project button.

1.0.2 Create a new script.

  1. From the File menu select New File > R Script
  2. Save file as “my_Rscript.R”

Write commands in code editor of R Studio and run them using icon Run in R Studio.

PREPARATION: Installing R packages

We need several R packages for our analysis today. Some are installed through bioconductor, some from CRAN through install.packages.

  • While installing packages, you may get messages about updating other packages. This is generally optional: you can select not to (enter ‘n’ in the prompt), but at some point it’s a good idea to update.

Bioconductor

If you have not already done so, install BiocManager for Bioconductor tools:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

Install Bioconductor packages:

BiocManager::install('multtest', update=F)
BiocManager::install("ComplexHeatmap", update=F)
BiocManager::install("circlize", update=F)
BiocManager::install("edgeR", update=F)

CRAN packages

Install Seurat and other CRAN packages:

install.packages('Seurat')
install.packages('Matrix')
install.packages('tidyr')
install.packages('dplyr')
install.packages('ggplot2')

Presto package for fast wilcox test (recommended)

Install devtools first if you have not done before:

install.packages('devtools')

Install presto from GitHub:

devtools::install_github('immunogenomics/presto')

Check installations

  • Try to load each package to confirm that the installation was successful:
library(Seurat)
library(multtest)
library(circlize)
library(ComplexHeatmap)
library(dplyr)
library(tidyr)
library(edgeR)
library(ggplot2)
library(Matrix)
library(presto)

1.1 View CellRanger report

  • Look at example quantification and demultiplexing report from CellRanger (for 10X)
  • Note these statistics:
    • Number of cells, and associated barcode coverage plot
    • Median UMI counts per cell
    • Median genes per cell
    • Mapping percent to genome
    • Mapping percent to transcriptome
    • Barcode identification percent
    • Sequencing saturation

Open this link in a web browser: https://wd.cri.uic.edu/scrna/10X_example_report.html

1.2 Read single-cell data into R

Practice with inDrop data (full matrix) and 10X data (sparse matrix)

1.2.1 inDrop data

  1. Load the libraries first
library(Seurat)
library(Matrix)
library(dplyr)
  1. Read in inDrop data first with regular read.delim function. We can read the inDrop matrix directly from the URL.
counts_indrop <- read.delim("https://wd.cri.uic.edu/scrna/inDrop_counts.txt",
   row.names=1)
  1. Convert to sparse matrix with the Matrix package
sparse_indrop <- Matrix(as.matrix(counts_indrop),sparse=T)
  1. (OPTIONAL) You can compare the sizes of the original and sparse matrix
format(object.size(counts_indrop), units="auto")
[1] "46.8 Mb"
format(object.size(sparse_indrop), units="auto")
[1] "13.4 Mb"
  1. Create the Seurat object from the sparse matrix.
seurat_indrop <- CreateSeuratObject(counts=sparse_indrop, project="indrop_sample_data")
seurat_indrop
An object of class Seurat 
9435 features across 1275 samples within 1 assay 
Active assay: RNA (9435 features, 0 variable features)
 1 layer present: counts

1.2.2 10X data

  • First download this file from a web browser: https://wd.cri.uic.edu/scrna/10X_test.zip
  • Then unzip it on your computer.
  • NOTE: we’re assuming you have downloaded it to your ~/Downloads folder. Please change the path in the first command below as necessary based on where you downloaded the data.
  • When we use the Read10X function, we give it a folder name, and it expects to find 3 files in that folder:
    1. matrix.mtx.gz (gene expression matrix in sparse format)
    2. features.tsv.gz (list of gene names, usually with 2 columns, Ensembl ID and gene symbol)
    3. barcodes.tsv.gz (list of cell barcodes)
  • If you want to do the analysis with respect to gene symbols, you’d use ‘gene.column=2’ below. But some gene symbols will be duplicated. Best to use IDs as the primary identifier, and we’ll show how to bring gene symbols back in at Exercise 2.1.
  1. Read the 10X data.
    • If you are using a macOS computer and unzipped the file in your Downloads folder, execute the following.
    sparse_10X <- Read10X("~/Downloads/10X_test/filtered_feature_bc_matrix/", 
                          gene.column=1)
    • If you are using a Windows computer and unzipped the file in your Downloads folder, execute the following.
    sparse_10X <- Read10X("~/../Downloads/10X_test/filtered_feature_bc_matrix/", 
                          gene.column=1)
  2. Create the Seurat object
seurat_10X <- CreateSeuratObject(counts=sparse_10X, project="10X_sample_data")
  1. View the created Seurat object
seurat_10X
An object of class Seurat 
31053 features across 1301 samples within 1 assay 
Active assay: RNA (31053 features, 0 variable features)
 1 layer present: counts

1.3 Quality control and filtering

1.3.1 Download the data and read into R (TAKE HOME: Skip for now, try later at home.)

NOTE: skip this part, try at home later. Skip to Cell QC steps.

Now we will analyze a larger 10X data set. These are data from a published study, Martos SN, Campbell MR, Lozoya OA, Wang X et al. Single-cell analyses identify dysfunctional CD16+ CD8 T cells in smokers. Cell Rep Med 2020 Jul 21;1(4). PMID: 33163982.

  • Peripheral blood cells from smoking and nonsmoking adults were extracted, captures run on 10X.
  • 8 total samples, 4 each from smoker and non-smoker.
  • GEO accession GSE138867.
  • The cells in these samples have been sub-sampled to 1000 cells each to save on computational time during the workshop.

Please download zip files with the data first:

Unzip these on your computer. Note the folder these are downloaded to. The exercises below assume they are in your ~/Downloads folder.

Ensure the Seurat package is loaded.

library(Seurat)

Read the 10X files

  1. If you are using a macOS computer and unzipped the file in your Downloads folder, execute the following.
V035_F031 <- Read10X("~/Downloads/GSE138867/sub_V035_F031", gene.column=1)
V039_F093 <- Read10X("~/Downloads/GSE138867/sub_V039_F093", gene.column=1)
V040_F198 <- Read10X("~/Downloads/GSE138867/sub_V040_F198", gene.column=1)
V041_F043 <- Read10X("~/Downloads/GSE138867/sub_V041_F043", gene.column=1)
V043_F047 <- Read10X("~/Downloads/GSE138867/sub_V043_F047", gene.column=1)
V044_F037 <- Read10X("~/Downloads/GSE138867/sub_V044_F037", gene.column=1)
V045_F120 <- Read10X("~/Downloads/GSE138867/sub_V045_F120", gene.column=1)
V054_F209 <- Read10X("~/Downloads/GSE138867/sub_V054_F209", gene.column=1)
  1. If you are using a Windows computer and unzipped the file in your Downloads folder, execute the following.
V035_F031 <- Read10X("~/../Downloads/GSE138867/sub_V035_F031", gene.column=1)
V039_F093 <- Read10X("~/../Downloads/GSE138867/sub_V039_F093", gene.column=1)
V040_F198 <- Read10X("~/../Downloads/GSE138867/sub_V040_F198", gene.column=1)
V041_F043 <- Read10X("~/../Downloads/GSE138867/sub_V041_F043", gene.column=1)
V043_F047 <- Read10X("~/../Downloads/GSE138867/sub_V043_F047", gene.column=1)
V044_F037 <- Read10X("~/../Downloads/GSE138867/sub_V044_F037", gene.column=1)
V045_F120 <- Read10X("~/../Downloads/GSE138867/sub_V045_F120", gene.column=1)
V054_F209 <- Read10X("~/../Downloads/GSE138867/sub_V054_F209", gene.column=1)

Create the Seurat objects

scV035_F031 <- CreateSeuratObject(counts=V035_F031, project="V035_F031")
scV039_F093 <- CreateSeuratObject(counts=V039_F093, project="V039_F093")
scV040_F198 <- CreateSeuratObject(counts=V040_F198, project="V040_F198")
scV041_F043 <- CreateSeuratObject(counts=V041_F043, project="V041_F043")
scV043_F047 <- CreateSeuratObject(counts=V043_F047, project="V043_F047")
scV044_F037 <- CreateSeuratObject(counts=V044_F037, project="V044_F037")
scV045_F120 <- CreateSeuratObject(counts=V045_F120, project="V045_F120")
scV054_F209 <- CreateSeuratObject(counts=V054_F209, project="V054_F209")

Combine the Seurat objects together to make one master data set. The add.cell.ids parameter specifies a prefix to add to the cell IDs when combining the data. This avoids situations where different cell barcodes match from different samples.

sc_data <- merge(scV035_F031, 
  y=c(scV039_F093,scV040_F198,scV041_F043,scV043_F047,
      scV044_F037,scV045_F120,scV054_F209),
  add.cell.ids=c("s1","s2","s3","s4","s5","s6","s7","s8"))

1.3.2 Cell QC steps (Start at this point in the workshop)

Read the prepared RDS file, which reflects all of the steps in the last sub-exercise.

Note that orig.ident is how we’re keeping track of which file came from which sample. Add group metadata to the Seurat object.

sc_data <- readRDS(url("https://wd.cri.uic.edu/scrna/GSE138867.rds"))

Join the counts layers across all samples.

sc_data <- JoinLayers(sc_data)

Count cells per sample.

head(sc_data@meta.data)
                      orig.ident nCount_RNA nFeature_RNA
s1_AAACGGGAGACTGTAA-1  V035_F031       5201         1237
s1_AAACGGGTCTCAAACG-1  V035_F031       3909         1169
s1_AAAGTAGGTAGCAAAT-1  V035_F031       5026         1335
s1_AAAGTAGGTCCTCCAT-1  V035_F031       9112         2393
s1_AAAGTAGTCCGCAAGC-1  V035_F031       4087         1259
s1_AAATGCCAGCGTTCCG-1  V035_F031       2716           88
table(sc_data$orig.ident)

V035_F031 V039_F093 V040_F198 V041_F043 V043_F047 V044_F037 V045_F120 V054_F209 
     1000      1000      1000      1000      1000      1000      1000      1000 
# add in sample group metadata
metadata <- read.delim("https://wd.cri.uic.edu/scrna/GSE138867_metadata.txt",row.names=1)
sc_data@meta.data$group <- metadata[sc_data@meta.data$orig.ident,"Group"]
head(sc_data@meta.data)
                      orig.ident nCount_RNA nFeature_RNA  group
s1_AAACGGGAGACTGTAA-1  V035_F031       5201         1237 Smoker
s1_AAACGGGTCTCAAACG-1  V035_F031       3909         1169 Smoker
s1_AAAGTAGGTAGCAAAT-1  V035_F031       5026         1335 Smoker
s1_AAAGTAGGTCCTCCAT-1  V035_F031       9112         2393 Smoker
s1_AAAGTAGTCCGCAAGC-1  V035_F031       4087         1259 Smoker
s1_AAATGCCAGCGTTCCG-1  V035_F031       2716           88 Smoker

Add in gene metadata (gene symbols).

genes_to_ids <- read.table("https://wd.cri.uic.edu/scrna/features.tsv",
  row.names=1, col.names=c("","gene.symbol"))
head(genes_to_ids)
                 gene.symbol
ENSG00000243485   MIR1302-10
ENSG00000237613      FAM138A
ENSG00000186092        OR4F5
ENSG00000238009 RP11-34P13.7
ENSG00000239945 RP11-34P13.8
ENSG00000237683   AL627309.1
sc_data[["RNA"]] <- AddMetaData(sc_data[["RNA"]],
  genes_to_ids[Features(sc_data),,drop=F])

1.3.3 Basic quality control checks

  • Check mitochondrial counts per cell: high levels of mitochondrial expression are an indication of a dying cell, as mitochrondrial contents spill into the cytoplasm.
  • Also check the distribution of UMI counts and genes expressed per cell.

Read in a list of the coordinates for each gene so, that we know which genes are on chrM

genes <- read.table("https://wd.cri.uic.edu/scrna/hg19_genes.bed")
dim(genes)
[1] 43319     6
# column 4 have the gene IDs, and column 1 the chromosome
head(genes)
     V1        V2        V3              V4 V5 V6
1  chrX  99883667  99894988 ENSG00000000003  0  -
2  chrX  99839799  99854882 ENSG00000000005  0  +
3 chr20  49551404  49575092 ENSG00000000419  0  -
4  chr1 169818772 169863408 ENSG00000000457  0  -
5  chr1 169631245 169823221 ENSG00000000460  0  +
6  chr1  27938575  27961788 ENSG00000000938  0  -

Make a list of the mitochondrial genes: all gene IDs on the mitochondrial chromosome.

mt.genes <- as.character(genes[genes[,1]=="chrM", 4])
length(mt.genes)
[1] 13

Compute percent mitochondrial expression per cell

sc_data[["percent.MT"]] <- PercentageFeatureSet(sc_data, features=mt.genes)

NOTE: There are a few ways to get a list of the mitochondrial gene IDs for your genome.

  • You can download BED files similar to this one from UCSC genome browser’s Table Browser https://genome.ucsc.edu/cgi-bin/hgTables, just make sure to get the one that matches the gene IDs in your file.
  • You can also download or request a copy of the GTF file used to quantify your expression data; GTF is a different format, but can also be used to get a list of gene IDs and the chromosomes they are on.
  • Finally, you can also ask your local bioinformatics core for assistance!

Violin plots of the distribution of counts, genes, and mitochondrial expression

VlnPlot(sc_data, features = c("nFeature_RNA","nCount_RNA","percent.MT"), pt.size=0.2)
Warning: Default search for "data" layer in "RNA" assay yielded no results;
utilizing "counts" layer instead.

Scatterplot of number of UMI counts vs % mitochondrial expression. Note that most cells with high levels of mitochrondrial expression have very low UMI counts.

FeatureScatter(sc_data, feature1="nCount_RNA", feature2="percent.MT")

1.3.4 Filtering cells for downstream analysis

Based on the violin plots and scatterplots, decide on a reasonable threshold for filtering out the “bad” cells.

  • We want to remove cells with very low expression, or high levels of mitochondrial expression, as those have less reliable signals or may be dying.

For today, our passing cells must have:

  • Number of genes (nFeature) per cell greater than 500
  • UMI counts (nCount) per cell greater than 2000
  • Less than 10% mitochondrial content
sc_subset <- subset(sc_data, 
  nFeature_RNA>500 & nCount_RNA>2000 & percent.MT < 10)
sc_subset
An object of class Seurat 
32738 features across 7392 samples within 1 assay 
Active assay: RNA (32738 features, 0 variable features)
 1 layer present: counts

1.3.5 Doublet analysis (TAKE HOME: Skip for now, try later at home. Jump to the next sub-section.)

NOTE: skip this part, try at home later. Skip to Doublet removal.

Run a doublet check with the SCDS library. We want to do this after the cell QC steps, but we should also analyze each sample separately.

library(scds)
library(SingleCellExperiment)
samples <- unique(sc_subset@meta.data$orig.ident)
doublets_df <- data.frame()
for(s in samples){
  # subset to this sample
  this_sc <- subset(sc_subset, orig.ident == s)
  # get counts
  this_counts <- GetAssayData(this_sc, layer="counts")
  # set up SingleCellExperiment object
  this_sce <- SingleCellExperiment(list(counts=this_counts))
  # run hybrid method from scds
  cxds <- cxds_bcds_hybrid(this_sce)
  # make data frame of results
  this_df <- data.frame("Barcode"=names(cxds$hybrid_score),
    "Score" = cxds$hybrid_score,
    "Sample" = s)
  doublets_df <- rbind(doublets_df, this_df)
}

1.3.6 Doublet removal

To save time, read in the results of the doublet analysis.

We will identify the top 5% of cells from each sample as doublets.

doublets_df <- read.delim("https://wd.cri.uic.edu/scrna/doublet_analysis.txt")
head(doublets_df)
                Barcode     Score    Sample
1 s1_AAACGGGAGACTGTAA-1 0.2275562 V035_F031
2 s1_AAACGGGTCTCAAACG-1 0.2172869 V035_F031
3 s1_AAAGTAGGTAGCAAAT-1 0.5191589 V035_F031
4 s1_AAAGTAGGTCCTCCAT-1 1.4636403 V035_F031
5 s1_AAAGTAGTCCGCAAGC-1 0.2578670 V035_F031
6 s1_AAATGCCTCTTGAGGT-1 0.4076141 V035_F031
library(dplyr)
doublet_cells <- doublets_df %>% 
  group_by(Sample) %>%
  slice_max(order_by=Score, prop=0.05)
# count how many cells we selected per sample
table(doublet_cells$Sample)

V035_F031 V039_F093 V040_F198 V041_F043 V043_F047 V044_F037 V045_F120 V054_F209 
       47        46        48        47        46        43        45        43 
# add a doublet column to the metadata and filter
sc_subset@meta.data$IsDoublet <- rownames(sc_subset@meta.data) %in% doublet_cells$Barcode
head(sc_subset@meta.data)
                      orig.ident nCount_RNA nFeature_RNA  group percent.MT
s1_AAACGGGAGACTGTAA-1  V035_F031       5201         1237 Smoker   3.653144
s1_AAACGGGTCTCAAACG-1  V035_F031       3909         1169 Smoker   3.223331
s1_AAAGTAGGTAGCAAAT-1  V035_F031       5026         1335 Smoker   1.929964
s1_AAAGTAGGTCCTCCAT-1  V035_F031       9112         2393 Smoker   3.435031
s1_AAAGTAGTCCGCAAGC-1  V035_F031       4087         1259 Smoker   4.061659
s1_AAATGCCTCTTGAGGT-1  V035_F031       5521         1292 Smoker   1.829379
                      IsDoublet
s1_AAACGGGAGACTGTAA-1     FALSE
s1_AAACGGGTCTCAAACG-1     FALSE
s1_AAAGTAGGTAGCAAAT-1     FALSE
s1_AAAGTAGGTCCTCCAT-1      TRUE
s1_AAAGTAGTCCGCAAGC-1     FALSE
s1_AAATGCCTCTTGAGGT-1     FALSE
sc_subset <- subset(sc_subset, IsDoublet == F)

1.3.7 Count the final fraction of cells remaining

Check what fraction of cells we kept from each sample, using the table() function

# counts before filtering
orig.counts <- table(sc_data$orig.ident)
# counts after filtering
subset.counts <- table(sc_subset$orig.ident)
# combine and compare counts
cell_stats <- cbind("Starting Cells" = orig.counts,
  "Retained Cells" = subset.counts,
  "Fraction" = subset.counts/orig.counts)
cell_stats
          Starting Cells Retained Cells Fraction
V035_F031           1000            894    0.894
V039_F093           1000            891    0.891
V040_F198           1000            923    0.923
V041_F043           1000            902    0.902
V043_F047           1000            879    0.879
V044_F037           1000            828    0.828
V045_F120           1000            874    0.874
V054_F209           1000            836    0.836

1.4 Gene feature selection and scaling

  • Normalize gene expression
  • Find variable genes
  • Scale (z-score) genes
  • Run PCA

If you need to catch up, read in the RDS file that is consistent with the QC steps above:

sc_subset <- readRDS(url("https://wd.cri.uic.edu/scrna/postQC.rds"))
  1. Normalize the data
sc_subset <- NormalizeData(sc_subset)
Normalizing layer: counts
  1. Find variable genes. In practice, you may want to vary the number of features you select here based on the variable features plot.
sc_subset <- FindVariableFeatures(sc_subset, nfeatures=4000)
Finding variable features for layer counts
VariableFeaturePlot(sc_subset)
Warning in scale_x_log10(): log-10 transformation introduced infinite values.

  1. Scale (z-score) the data. NOTE the ScaleData function will only scale the features that are identified as variable via the FindVariableFeatures() function.
sc_subset <- ScaleData(sc_subset)
Centering and scaling data matrix
  1. Run the PCA. npcs = 50 is the default. You may want to increase for more complex data sets. Also set “verbose=F” otherwise Seurat will print out a number of statistics to the console.
sc_subset <- RunPCA(sc_subset, npcs = 50, verbose=F)
  1. Generate the elbow (scree) plot of the PCA results.
ElbowPlot(sc_subset, ndims=50)

  1. Make a heatmap of signals from the top 24 PCs. To decide how many we want to keep we’re plotting the top 300 cells based on their weightings for each PC, with an even mix of positive and negative associations (balanced=T). We want to examine each PC’s heatmap to see at what point the “signal” starts to go away.

NOTE: If you get warnings about figure margins too large, try making the plot pane in your R studio bigger

DimHeatmap(sc_subset, dims=1:50, cells=300, balanced=T)

1.4.1 JackStraw (TAKE HOME)

DO NOT RUN THESE STEPS TODAY as they are computationally demanding. We will review the commands and results in the exercise handout.

JackStraw is a method for computing p-values for principal components. As implemented in Seurat, the steps are:

  • JackStraw() computes projected PCA scores for a random permutation of a subset of the data, and compares these values to the observed PCA scores for the genes. This allows the computation of a p-value for the association of each gene with each PC.
  • ScoreJackStraw() computes p-values for the PCs based on the distribution of p-values obtained for all genes against that PC from the JackStraw step.
  • JackStrawPlot() makes a diagnostic plot we can use to evaluate the signal level and significance of each PC.

NOTE: The selection of the number of PCs to use (dims parameter) is different for the different functions. For JackStraw() you just give a number (e.g., 50), and for the other two you give a range (e.g., 1:50).

sc_subset <- JackStraw(sc_subset, num.replicate = 100, dims = 50)
sc_subset <- ScoreJackStraw(sc_subset, dims = 1:50)
JackStrawPlot(sc_subset, dims = 1:50)

These are Q-Q plots: PC curves that are more diverged from the black dashed diagonal line have a higher level of signal. P-values are given in the legend.

  • You don’t necessarily want to set a simplistic p<0.05 threshold, as this may still retain a lot of PCs with minimal (but statistically significant) levels of signal.
    • A PC can be significant but still capture variability primarily from a tiny fraction of cells, and therefore be more likely to be noise.
  • What’s more helpful is to compare the exponent magnitude of the PCs as you go down the list.
    • PCs 1-21 all have p-values <1e-20.
    • PCs 22-24 have small p-values, but they’re getting bigger.
    • PC 25 has a large p-value (0.25).
    • In this case, it seems reasonable to limit the PCs to top 21-24.

1.5 Clustering

1.5.1 Run clustering on the top PCs to filter out noise

  • Based on the PCA visualizations, we’ll take the top 21 PCs, and cluster on those data.
  • Then visualizations and differential expression across clusters.
  1. First run FindNeighbors(). For this exercise we will use the first 21 PCs to determine the distance between the cells and thus who are the “neighbors”
pca.dims <- 1:21
sc_subset <- FindNeighbors(sc_subset, dims=pca.dims)
Computing nearest neighbor graph
Computing SNN
  1. Find cell clusters with a resolution 0.5. The larger the number the finer the resolution. Optimal cluster resolution will likely have to be determined empirically.
sc_subset <- FindClusters(sc_subset, resolution=0.5)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 7027
Number of edges: 255224

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.8817
Number of communities: 12
Elapsed time: 0 seconds
# see where clusters are stored
head(sc_subset@meta.data)
                      orig.ident nCount_RNA nFeature_RNA  group percent.MT
s1_AAACGGGAGACTGTAA-1  V035_F031       5201         1237 Smoker   3.653144
s1_AAACGGGTCTCAAACG-1  V035_F031       3909         1169 Smoker   3.223331
s1_AAAGTAGGTAGCAAAT-1  V035_F031       5026         1335 Smoker   1.929964
s1_AAAGTAGTCCGCAAGC-1  V035_F031       4087         1259 Smoker   4.061659
s1_AAATGCCTCTTGAGGT-1  V035_F031       5521         1292 Smoker   1.829379
s1_AACACGTCAACTGCTA-1  V035_F031       3614          804 Smoker   2.102933
                      IsDoublet RNA_snn_res.0.5 seurat_clusters
s1_AAACGGGAGACTGTAA-1     FALSE              10              10
s1_AAACGGGTCTCAAACG-1     FALSE               2               2
s1_AAAGTAGGTAGCAAAT-1     FALSE               0               0
s1_AAAGTAGTCCGCAAGC-1     FALSE               3               3
s1_AAATGCCTCTTGAGGT-1     FALSE               3               3
s1_AACACGTCAACTGCTA-1     FALSE               0               0
  1. Generate UMAP plot.
sc_subset <- RunUMAP(sc_subset, dims=pca.dims)
Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session
13:55:16 UMAP embedding parameters a = 0.9922 b = 1.112
13:55:16 Read 7027 rows and found 21 numeric columns
13:55:16 Using Annoy for neighbor search, n_neighbors = 30
13:55:16 Building Annoy index with metric = cosine, n_trees = 50
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
13:55:16 Writing NN index file to temp file /tmp/Rtmpz7m5F6/file2a7a1d51e1f1d8
13:55:16 Searching Annoy index using 1 thread, search_k = 3000
13:55:18 Annoy recall = 100%
13:55:18 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
13:55:19 Initializing from normalized Laplacian + noise (using RSpectra)
13:55:19 Commencing optimization for 500 epochs, with 289920 positive edges
13:55:26 Optimization finished
# see where UMAP coordinates are stored
head(sc_subset@reductions$umap@cell.embeddings)
                           umap_1    umap_2
s1_AAACGGGAGACTGTAA-1 -11.9474511  6.451997
s1_AAACGGGTCTCAAACG-1   2.0845266 -3.441252
s1_AAAGTAGGTAGCAAAT-1   0.1690202  2.425961
s1_AAAGTAGTCCGCAAGC-1   0.5361578 -2.287897
s1_AAATGCCTCTTGAGGT-1   3.4209168 -1.337700
s1_AACACGTCAACTGCTA-1   4.7771568  1.230965
DimPlot(sc_subset, reduction='umap', label=T)

Save our work (Recommended!)

saveRDS(sc_subset,"clustered.rds")

2 Afternoon

If you closed R, then reload the sc_subset Seurat object. Otherwise skip this step.

library(Seurat)
sc_subset <- readRDS("clustered.rds")

If you closed R and forgot to save your Seurat object, you can read it from our server. Otherwise skip this step. NOTE Use the url() function when reading an RDS from the web.

library(Seurat)
sc_subset <- readRDS(url("https://wd.cri.uic.edu/scrna/clustered.rds"))

2.1 Check for cell cycle and batch biases

2.1.1 Cell cycle analysis

Look at cell cycle first. The Seurat library also loads genes associated with cell cycle. We will use those, though you may want to consider reviewing the list when you run an analysis.

We will use one of the features.tsv files from CellRanger to map Ensembl IDs to gene symbols. If you added the gene symbols to the Seurat object’s [[RNA]] meta.data earlier you can also pull this information from there.

# load gene symbols for cell cycle
s.genes <- cc.genes$s.genes
g2m.genes <- cc.genes$g2m.genes
head(s.genes)
[1] "MCM5" "PCNA" "TYMS" "FEN1" "MCM2" "MCM4"
head(g2m.genes)
[1] "HMGB2"  "CDK1"   "NUSAP1" "UBE2C"  "BIRC5"  "TPX2"  

Get ID to gene symbol conversion table.

genes_to_ids <- data.frame(Features(sc_subset),
  sc_subset[["RNA"]]@meta.data["gene.symbol"])
## or ##
genes_to_ids <- read.table("https://wd.cri.uic.edu/scrna/features.tsv")

Then map gene symbols to Ensembl IDs.

s.genes_ids <- merge(s.genes, genes_to_ids, by.x=1, by.y=2)
g2m.genes_ids <- merge(g2m.genes, genes_to_ids, by.x=1, by.y=2)
head(s.genes_ids)
      x              V1
1 ATAD2 ENSG00000156802
2   BLM ENSG00000197299
3 BRIP1 ENSG00000136492
4 CCNE2 ENSG00000175305
5 CDC45 ENSG00000093009
6  CDC6 ENSG00000094804
# cell cycle scoring
sc_subset <- CellCycleScoring(sc_subset,
  s.features=s.genes_ids[,2], g2m.features=g2m.genes_ids[,2])
head(sc_subset@meta.data)
                      orig.ident nCount_RNA nFeature_RNA  group percent.MT
s1_AAACGGGAGACTGTAA-1  V035_F031       5201         1237 Smoker   3.653144
s1_AAACGGGTCTCAAACG-1  V035_F031       3909         1169 Smoker   3.223331
s1_AAAGTAGGTAGCAAAT-1  V035_F031       5026         1335 Smoker   1.929964
s1_AAAGTAGTCCGCAAGC-1  V035_F031       4087         1259 Smoker   4.061659
s1_AAATGCCTCTTGAGGT-1  V035_F031       5521         1292 Smoker   1.829379
s1_AACACGTCAACTGCTA-1  V035_F031       3614          804 Smoker   2.102933
                      IsDoublet RNA_snn_res.0.5 seurat_clusters     S.Score
s1_AAACGGGAGACTGTAA-1     FALSE              10              10 -0.04733959
s1_AAACGGGTCTCAAACG-1     FALSE               2               2 -0.06362626
s1_AAAGTAGGTAGCAAAT-1     FALSE               0               0  0.01897592
s1_AAAGTAGTCCGCAAGC-1     FALSE               3               3 -0.02388267
s1_AAATGCCTCTTGAGGT-1     FALSE               3               3  0.02537982
s1_AACACGTCAACTGCTA-1     FALSE               0               0 -0.03449400
                        G2M.Score Phase
s1_AAACGGGAGACTGTAA-1 -0.02549001    G1
s1_AAACGGGTCTCAAACG-1 -0.06068320    G1
s1_AAAGTAGGTAGCAAAT-1 -0.05418265     S
s1_AAAGTAGTCCGCAAGC-1 -0.04009162    G1
s1_AAATGCCTCTTGAGGT-1 -0.02018164     S
s1_AACACGTCAACTGCTA-1 -0.03611933    G1
# visualize cell cycle in UMAP
DimPlot(sc_subset, group.by="Phase")

# visualize scores versus clusters in boxplot
library(ggplot2)
ggplot(sc_subset@meta.data, aes(x=RNA_snn_res.0.5, y=S.Score)) +
  geom_boxplot()

ggplot(sc_subset@meta.data, aes(x=RNA_snn_res.0.5, y=G2M.Score)) +
  geom_boxplot()

In this case, it does not look like we had cell cycle bias.

If we did, and we decided it was important to remove it, we would go back to the ScaleData step to correct for it (don’t run this now):

sc_subset2 <- ScaleData(sc_subset, vars.to.regress = c("S.Score", "G2M.Score"))

Then rerun PCA and clustering steps, and re-evaluate after clustering the data again.

2.1.2 Batch effect analysis

Evaluate batch effect by looking at each sample on the UMAP plot.

# UMAP plot by cluster and sample
DimPlot(sc_subset, group.by="seurat_clusters",split.by="orig.ident")

# Number of cells per sample per cluster
table(sc_subset@meta.data[,c("orig.ident","seurat_clusters")])
           seurat_clusters
orig.ident    0   1   2   3   4   5   6   7   8   9  10  11
  V035_F031 323 185  83 124  16  12  23  31  19  43  35   0
  V039_F093 222 172  65  13   1   1 170 129  58  13  40   7
  V040_F198  17  21 219  38   5 482  16  24  24  27  44   6
  V041_F043  34   6  65 163 339   5 108  50  65  57   9   1
  V043_F047 288 190 111  77  32   2  38  29  61  26  23   2
  V044_F037   7   8 172 238 182   4  41  67  52  17  30  10
  V045_F120 313  80 114  93   4   5 106  63  59   8  26   3
  V054_F209  55 424 112  44   1   5   8  28  76  56  21   6

There are some sample-specific differences, but the samples are generally occupying the same spaces in the UMAP plot, so this is probably not an issue.

If we did want to correct for batch effect, these are the steps (don’t run this now):

# split object based on the orig.ident, or 
# whatever batch factor is appropriate
sc_split <- SplitObject(sc_subset, split.by="orig.ident")
# run normalization and variant feature selection separately for each sample
sc_split <- lapply(sc_split, function(x){
  x <- NormalizeData(x)
  x <- FindVariableFeatures(x)})
# find integration features
integration_genes <- SelectIntegrationFeatures(sc_split)
# run integration
integration_anchors <- FindIntegrationAnchors(sc_split, anchor.features=integration_genes)
sc_integrated <- IntegrateData(anchorset = integration_anchors)
# this creates a NEW assay, "integrated"
# use this instead of the "RNA" assay
DefaultAssay(sc_integrated) <- "integrated"

At this point you would rerun the analysis from the ScaleData step, and re-evalute the samples on the new clustering results and UMAP plot. Note though that the integrated assay layer ONLY has the variable genes that were identified before; it does not have the full transcriptome.

2.1.3 Comparison of clustering results

Refer to take-home exercise 3.1 for examples in generating different clustering results and comparing those for similarities and differences.

2.2 Putative cell type identification

We will try two strategies:

  1. Look at marker expression per cluster.
  2. Look at gene expression relative to a “gold standard” cell type data set.

Read in the clustering results if necessary:

sc_subset <- readRDS(url("https://wd.cri.uic.edu/scrna/clustered.rds"))

2.2.1 Marker expression

We have prepared a small list of marker genes and their cell types for common immune cell types. In practice, you would want to derive your own cell marker list based on the types of cells you anticipate finding in your data set, the level of specificity you’re looking for in defining cell types (e.g., generic T cell, vs T helper, T reg, T responder, etc.), and what markers you consider to be specific and informative for those cell types.

  1. Read in the marker list
markers <- read.delim("https://wd.cri.uic.edu/scrna/blood_markers.txt")
markers
   Marker Cell.Type
1   CD79A   B cells
2   MS4A1   B cells
3  BCL11A   B cells
4  IFITM3 Monocytes
5    PSAP Monocytes
6    CST3 Monocytes
7     LYZ Monocytes
8    GNLY  NK cells
9  FCGR3A  NK cells
10   CD3D   T cells
11   CD3E   T cells
12   CD3G   T cells
13   SELL   T cells
  1. Convert between ensembl IDs and gene symbols using a named vector

Get ID to gene symbol conversion table.

genenames <- data.frame(Features(sc_subset),
  sc_subset[["RNA"]]@meta.data["gene.symbol"])
## or
genenames <- read.table("https://wd.cri.uic.edu/scrna/features.tsv")

Make named vector for marker genes to symbols.

gene_to_id <- genenames[,1]
names(gene_to_id) <- make.unique(genenames[,2])
markers$ID <- gene_to_id[markers$Marker]
markers
   Marker Cell.Type              ID
1   CD79A   B cells ENSG00000105369
2   MS4A1   B cells ENSG00000156738
3  BCL11A   B cells ENSG00000119866
4  IFITM3 Monocytes ENSG00000142089
5    PSAP Monocytes ENSG00000197746
6    CST3 Monocytes ENSG00000101439
7     LYZ Monocytes ENSG00000090382
8    GNLY  NK cells ENSG00000115523
9  FCGR3A  NK cells ENSG00000203747
10   CD3D   T cells ENSG00000167286
11   CD3E   T cells ENSG00000198851
12   CD3G   T cells ENSG00000160654
13   SELL   T cells ENSG00000188404

Use DotPlot from Seurat to look at expression levels per cluster.

# get unique list of genes
DotPlot(sc_subset, features=markers$ID, group.by="RNA_snn_res.0.5" ) + RotatedAxis()

The DotPlot would be more useful if we could label the x-axis with gene symbols and cell type descriptions. Let’s do that.

# format the marker names as "gene: cell type"
#  - this will make a named vector
#  - the values are the labels to plot
#  - the names are the gene IDs
markers_description = paste0(markers$Marker,": ", markers$Cell.Type)
names(markers_description) = markers$ID
markers_description
    ENSG00000105369     ENSG00000156738     ENSG00000119866     ENSG00000142089 
   "CD79A: B cells"    "MS4A1: B cells"   "BCL11A: B cells" "IFITM3: Monocytes" 
    ENSG00000197746     ENSG00000101439     ENSG00000090382     ENSG00000115523 
  "PSAP: Monocytes"   "CST3: Monocytes"    "LYZ: Monocytes"    "GNLY: NK cells" 
    ENSG00000203747     ENSG00000167286     ENSG00000198851     ENSG00000160654 
 "FCGR3A: NK cells"     "CD3D: T cells"     "CD3E: T cells"     "CD3G: T cells" 
    ENSG00000188404 
    "SELL: T cells" 
# Add custom x labels to our plot
library(ggplot2)
DotPlot(sc_subset, features=markers$ID, group.by="RNA_snn_res.0.5" ) + 
  RotatedAxis() +
  scale_x_discrete(labels=function(x) markers_description)

You can evaluate the markers manually to make an assessment of the cell type per cluster. For more examples of visualizations of these markers, and an automated approach for assigning cell types, look at take-home exercise 3.2.

2.2.2 Label transfer from gold standard

We need an external single-cell data set whose cell types we trust. This is typically from published data sets, and we need to find a source where the authors have published not only the gene expression matrix files (as required), but also the cell type metadata (not always included).

Then you would load these files into a Seurat object, run normalization as above, and add the cell types to the single-cell metadata.

For this exercise, we are providing data from:

Single-cell RNA-Seq analysis reveals cell subsets and gene signatures associated with rheumatoid arthritis disease activity, Binvignat et al., JCI Insight, 2024 Aug 22; 9(16): e178499.

# read in Seurat object with data
sc_ref <- readRDS(url("https://wd.cri.uic.edu/scrna/sc_reference.rds"))
# run normalization and find variable features
sc_ref <- NormalizeData(sc_ref)
sc_ref <- FindVariableFeatures(sc_ref, nfeatures=4000)
# confirm that the gene identifiers match, and subset to genes in common
head(rownames(sc_ref))
[1] "ENSG00000186092" "ENSG00000238009" "ENSG00000239945" "ENSG00000241599"
[5] "ENSG00000229905" "ENSG00000237491"
head(rownames(sc_subset))
[1] "ENSG00000243485" "ENSG00000237613" "ENSG00000186092" "ENSG00000238009"
[5] "ENSG00000239945" "ENSG00000237683"
common.genes <- intersect(rownames(sc_ref),rownames(sc_subset))
length(common.genes)
[1] 20667
# cell types we can transfer
table(sc_ref@meta.data$rough_annot)

   Bcells CD4Tcells CD8Tcells Monocytes   NKcells 
     5776     21333      3142      9497      3846 
table(sc_ref@meta.data$fine_annot)

   CD4 T central memory   CD4 T effector memory              CD4 T IFIT 
                   8739                    5049                     239 
            CD4 T Naive              yd T cells         CD8 T early Tem 
                   6724                     582                     915 
            CD8 T Naive               CD8 TEMRA     Classical Monocytes 
                   1164                    1063                    3653 
       IFITM3 Monocytes          IL1b-Monocytes             Myeloid DCs 
                    246                    2280                    2429 
Non-classical Monocytes            NKCD56bright               NKCD56low 
                    889                    3039                     807 
          Memory Bcells            Naive Bcells            Plasmablasts 
                   2462                    2739                     575 

We will use the rough_annot for now, but you can repeat the same process with fine_annot if you want.

# run the label transfer (transfer anchors) with Seurat
anchors <- FindTransferAnchors(reference=sc_ref, query=sc_subset, dims=1:30)
Performing PCA on the provided reference using 3776 features as input.
Projecting cell embeddings
Finding neighborhoods
Finding anchors
    Found 5948 anchors
predictions <- TransferData(anchorset=anchors, refdata=sc_ref@meta.data$rough_annot)
Finding integration vectors
Finding integration vector weights
Predicting cell labels
# compute the majority predicted cell type per cluster
predictions$Cluster <- sc_subset@meta.data$RNA_snn_res.0.5
prediction_summary <- table(predictions[,c("Cluster","predicted.id")])
prediction_summary
       predicted.id
Cluster Bcells CD4Tcells CD8Tcells Monocytes NKcells
     0       0      1255         4         0       0
     1       0       941       145         0       0
     2       0        13       855         0      73
     3       0       696        94         0       0
     4       0       579         1         0       0
     5       1       495        20         0       0
     6       0       506         4         0       0
     7       0         0         0       421       0
     8       0         1         0         0     413
     9     247         0         0         0       0
     10    228         0         0         0       0
     11      0         0         0        35       0
prediction_type <- apply(prediction_summary,1,function(x){
  colnames(prediction_summary)[which.max(x)]})
prediction_type
          0           1           2           3           4           5 
"CD4Tcells" "CD4Tcells" "CD8Tcells" "CD4Tcells" "CD4Tcells" "CD4Tcells" 
          6           7           8           9          10          11 
"CD4Tcells" "Monocytes"   "NKcells"    "Bcells"    "Bcells" "Monocytes" 
# assign these cell types in the Seurat object
sc_subset@meta.data$CellType <- prediction_type[sc_subset@meta.data$RNA_snn_res.0.5]
DimPlot(sc_subset, group.by = "CellType")

# compare to the clusters
DimPlot(sc_subset, reduction='umap', label=T)

2.3 Differential statistics

2.3.1 Differential expression between clusters

Run differential expression between clusters using the Wilcox test, running comparisons as one cluster vs all other cells. Only test genes that are expressed in at least 20% of cells (default is 1%).

degs <- FindAllMarkers(sc_subset, test.use="wilcox", min.pct=0.2)
Calculating cluster 0
Calculating cluster 1
Calculating cluster 2
Calculating cluster 3
Calculating cluster 4
Calculating cluster 5
Calculating cluster 6
Calculating cluster 7
Calculating cluster 8
Calculating cluster 9
Calculating cluster 10
Calculating cluster 11
head(degs)
                        p_val avg_log2FC pct.1 pct.2     p_val_adj cluster
ENSG00000145425 2.836137e-250  0.5026214 1.000 1.000 9.284946e-246       0
ENSG00000196154 1.550552e-233 -2.3296592 0.604 0.872 5.076198e-229       0
ENSG00000122026 6.752797e-233  0.4360492 1.000 1.000 2.210731e-228       0
ENSG00000166710 1.552457e-229 -0.6831864 1.000 1.000 5.082434e-225       0
ENSG00000144713 2.897885e-217  0.4282546 1.000 1.000 9.487095e-213       0
ENSG00000163682 2.439887e-196  0.4691579 1.000 1.000 7.987701e-192       0
                           gene
ENSG00000145425 ENSG00000145425
ENSG00000196154 ENSG00000196154
ENSG00000122026 ENSG00000122026
ENSG00000166710 ENSG00000166710
ENSG00000144713 ENSG00000144713
ENSG00000163682 ENSG00000163682

In this table

  • p_val is from Wilcox test
  • avg_log2FC is the log2 fold-change between cells in this cluster and all other cells
  • pct.1 is the fraction (out of 1) of cells in this cluster that express the gene
  • pct.2 is the fraction of all other cells that express the gene
  • p_val_adj is the adjusted p-value from Bonferroni (see note below)
  • cluster is the cluster in question; the other clusters will appear farther down this table
  • gene is the gene in question

NOTE on p_val_adj: These p-values are biased, as the clustering analysis intentionally divides cells based on differences in their transcriptome. They are also biased because this function pre-filters genes to test based on a minimum log2FC. Therefore, we fully expect to identify a large number of “significant” differences by design. It’s hard to correctly adjust for the expected distribution of p-values in this circumstance, so Seurat uses the overly conservative Bonferroni correction to try to over-adjust in a very rough sense.

Add in gene symbols, which we can find in any of the features files from CellRanger, or in the Seurat object’s RNA feature metadata (if you included it).

Get ID to gene symbol conversion table, this time using the IDs as rownames.

genenames <- sc_subset[["RNA"]]@meta.data["gene.symbol"]
rownames(genenames) = Features(sc_subset)
## or
genenames <- read.table("https://wd.cri.uic.edu/scrna/features.tsv",row.names=1)

Add gene symbols to DEGs table.

degs$symbol <- genenames[degs$gene, 1]
head(degs)
                        p_val avg_log2FC pct.1 pct.2     p_val_adj cluster
ENSG00000145425 2.836137e-250  0.5026214 1.000 1.000 9.284946e-246       0
ENSG00000196154 1.550552e-233 -2.3296592 0.604 0.872 5.076198e-229       0
ENSG00000122026 6.752797e-233  0.4360492 1.000 1.000 2.210731e-228       0
ENSG00000166710 1.552457e-229 -0.6831864 1.000 1.000 5.082434e-225       0
ENSG00000144713 2.897885e-217  0.4282546 1.000 1.000 9.487095e-213       0
ENSG00000163682 2.439887e-196  0.4691579 1.000 1.000 7.987701e-192       0
                           gene symbol
ENSG00000145425 ENSG00000145425  RPS3A
ENSG00000196154 ENSG00000196154 S100A4
ENSG00000122026 ENSG00000122026  RPL21
ENSG00000166710 ENSG00000166710    B2M
ENSG00000144713 ENSG00000144713  RPL32
ENSG00000163682 ENSG00000163682   RPL9

Count the number of up-regulated genes per cluster (p_val_adj < 0.05 and at least two-fold up-regulated).

table(degs[degs$p_val_adj<0.05 & degs$avg_log2FC>1, "cluster"])

  0   1   2   3   4   5   6   7   8   9  10  11 
  7  11  84   6  21  48  23 705 359 155 202 603 

Get top 5 up-regulated genes per cluster. The group_by() function will group the data.frame rows by the cluster column and the slice_max() function will return the top n (10) rows ranked by the column avg_log2FC.

library(dplyr)
top5 <- degs %>% group_by(cluster) %>% slice_max(n=5, order_by=avg_log2FC)
top5
# A tibble: 60 × 8
# Groups:   cluster [12]
       p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene            symbol    
       <dbl>      <dbl> <dbl> <dbl>     <dbl> <fct>   <chr>           <chr>     
 1 1.46e-119       1.83 0.4   0.133 4.79e-115 0       ENSG00000189283 FHIT      
 2 3.85e- 62       1.64 0.258 0.092 1.26e- 57 0       ENSG00000182463 TSHZ2     
 3 1.65e- 50       1.41 0.303 0.139 5.40e- 46 0       ENSG00000179862 CITED4    
 4 3.02e- 27       1.18 0.225 0.117 9.90e- 23 0       ENSG00000134352 IL6ST     
 5 2.12e- 32       1.15 0.268 0.141 6.95e- 28 0       ENSG00000186854 TRABD2A   
 6 6.09e-260       2.92 0.52  0.108 1.99e-255 1       ENSG00000160789 LMNA      
 7 1.65e-155       1.80 0.516 0.16  5.40e-151 1       ENSG00000129824 RPS4Y1    
 8 1.49e- 96       1.30 0.643 0.348 4.87e- 92 1       ENSG00000118503 TNFAIP3   
 9 7.02e- 43       1.30 0.323 0.153 2.30e- 38 1       ENSG00000273319 RP11-138A…
10 9.97e- 33       1.20 0.324 0.175 3.26e- 28 1       ENSG00000090104 RGS1      
# ℹ 50 more rows

Quick visualizations:

  • Visualizations for top genes in cluster 5 in side-by-side violin plot
top5_cluster5 <- top5[top5$cluster==5,]
VlnPlot(sc_subset, features=top5_cluster5$gene, pt.size=F)

  • Heatmap of top 5 genes for all clusters

NOTE! By default, Seurat will only scale the variable features identified by FindVariableFeatures().

DoHeatmap(sc_subset, features=top5$gene)

See take-home exercise 3.3 for more examples of visualizations and differential expression between clusters, including using gene symbols in the plots.

2.3.2 Differential expression between samples, single cell level

  • First subset the cells to the cluster of interest.
  • Then set the “Idents” to the appropriate metadata column.
  • Set logfc.threshold = 0 to test ALL genes with sufficient minimum expression. This avoids biasing the p-values by only testing genes with a minimum difference. Since we are testing within a cluster, the biases in transcriptome due to clustering are not an issue.
  • Then we can use the FDR correction instead of the more conservative Bonferroni correction that is the default in FindMarkers and FindAllMarkers.

NOTE: use a loop in R to perform this analysis over all clusters.

# example analyzing for cluster 1
sc_cluster1 = subset(sc_subset, subset = seurat_clusters == 1)
Idents(sc_cluster1) = "group"
table(Idents(sc_cluster1))

    Smoker Non-Smoker 
       638        448 
cluster1_stats <- FindMarkers(sc_cluster1, test.use="wilcox",
  ident.1="Smoker", ident.2="Non-Smoker", logfc.threshold=0)
cluster1_stats$p_val_adj <- p.adjust(cluster1_stats$p_val, method="fdr")
cluster1_stats$symbol <- genenames[rownames(cluster1_stats), 1]
head(cluster1_stats)
                       p_val avg_log2FC pct.1 pct.2    p_val_adj symbol
ENSG00000197728 6.293938e-42  0.5843158 1.000 0.984 6.271909e-38  RPS26
ENSG00000234745 1.665232e-35 -0.4939781 1.000 1.000 8.297017e-32  HLA-B
ENSG00000124614 2.406454e-22 -0.3337910 0.998 1.000 7.993437e-19  RPS10
ENSG00000170315 5.136330e-22 -0.5637031 0.958 0.984 1.279588e-18    UBB
ENSG00000244734 2.882615e-20  4.8212828 0.251 0.040 5.745051e-17    HBB
ENSG00000100097 6.073084e-20  1.2267216 0.618 0.364 1.008638e-16 LGALS1
# count number of significant changes
table(cluster1_stats$p_val_adj < 0.05)

FALSE  TRUE 
 9817   148 
# heatmap visualization
degs1 <- rownames(subset(cluster1_stats, p_val_adj<0.05))
norm <- GetAssayData(sc_cluster1, layer="data")
norm <- as.matrix(norm[degs1,])
norm <- t(scale(t(norm)))
library(ComplexHeatmap)
library(circlize)
col_annot <- HeatmapAnnotation(df=sc_cluster1@meta.data["group"])
col_fun <- colorRamp2(c(-2,0,2),c("blue","white","red"))
Heatmap(norm, name="Z-score", top_annotation=col_annot, col=col_fun,
  show_column_names=F, row_labels=genenames[rownames(norm),1])

Note: if you save the heatmap to a pdf with height=20 then the gene names will be legible.

See take-home exercise 3.3 for an example of a pseudo-bulk analysis for the same cluster.

2.3.3 Differential cell type abundance

Count the cells per group per cluster, and analyze in edgeR.

Skip the TMM calculation so that the normalized values can be expressed as percent.

cell_counts <- table(sc_subset@meta.data[,c("seurat_clusters","orig.ident")])
# obtain metadata and match column order
metadata <- read.delim("https://wd.cri.uic.edu/scrna/GSE138867_metadata.txt",row.names=1)
metadata <- metadata[colnames(cell_counts),]
# analysis in edgeR
library(edgeR)
Loading required package: limma
counts_dge <- DGEList(cell_counts, group=metadata)
counts_dge <- estimateDisp(counts_dge)
Using classic mode.
# check the BCV: it is almost always much higher for cell type abundance comparisons
sqrt(counts_dge$common.disp)
[1] 1.001724
counts_results <- exactTest(counts_dge)$table
counts_results$QValue <- p.adjust(counts_results$PValue)
counts_results
        logFC   logCPM      PValue     QValue
0  -1.0981006 17.45593 0.359683196 1.00000000
1   0.5565322 17.27211 0.666302586 1.00000000
2   0.7411713 17.05163 0.291838428 1.00000000
3   0.4165837 16.81623 0.667464155 1.00000000
4  -0.7736547 16.36955 0.661947851 1.00000000
5   5.1668568 16.13844 0.002807115 0.03368538
6  -2.2121969 16.17710 0.008722238 0.09594462
7  -0.8012039 15.92089 0.305316762 1.00000000
8  -0.4479417 15.90518 0.538818524 1.00000000
9   0.4979121 15.18609 0.541651722 1.00000000
10  0.4187248 15.07397 0.553495935 1.00000000
11  0.7830817 12.83111 0.459477230 1.00000000

It looks like clusters 5 and 6 show fairly significant differences (under a more lenient Q<0.1) in relative abundance between groups (note that cluster 0 has >2-fold change, but is not significant beceause of the high variability). Make a boxplot of the significant differences.

percent <- cpm(counts_dge)
# convert to percent (divide CPM by a million)
percent <- percent/1000000
# look at significant changes, add group info
percent_signif <- data.frame(t(percent[counts_results$QValue<0.1,]),
  group=metadata,check.names=F)
percent_signif
                    5           6      group
V035_F031 0.013422819 0.025727069     Smoker
V039_F093 0.001122334 0.190796857 Non-Smoker
V040_F198 0.522210184 0.017334778     Smoker
V041_F043 0.005543237 0.119733925 Non-Smoker
V043_F047 0.002275313 0.043230944 Non-Smoker
V044_F037 0.004830918 0.049516908     Smoker
V045_F120 0.005720824 0.121281465 Non-Smoker
V054_F209 0.005980861 0.009569378     Smoker
# make boxplot
library(ggplot2)
library(tidyr)

Attaching package: 'tidyr'
The following objects are masked from 'package:Matrix':

    expand, pack, unpack
percent_signif_long <- pivot_longer(percent_signif,
  cols=-group, names_to="cluster", values_to="abundance")
percent_signif_long
# A tibble: 16 × 3
   group      cluster abundance
   <chr>      <chr>       <dbl>
 1 Smoker     5         0.0134 
 2 Smoker     6         0.0257 
 3 Non-Smoker 5         0.00112
 4 Non-Smoker 6         0.191  
 5 Smoker     5         0.522  
 6 Smoker     6         0.0173 
 7 Non-Smoker 5         0.00554
 8 Non-Smoker 6         0.120  
 9 Non-Smoker 5         0.00228
10 Non-Smoker 6         0.0432 
11 Smoker     5         0.00483
12 Smoker     6         0.0495 
13 Non-Smoker 5         0.00572
14 Non-Smoker 6         0.121  
15 Smoker     5         0.00598
16 Smoker     6         0.00957
ggplot(percent_signif_long, aes(x=cluster, y=abundance, fill=group)) +
  geom_boxplot()

2.4 Incorporating Feature Barcoding (FBC) Data

2.4.1 Loading 10X data into Seurat (INSTRUCTOR DEMONSTRATION)

NOTE: DO NOT PERFORM THE FOLLOWING COMMANDS

This is an example of the R commands to load 10X data with both gene expression (GEX) and feature barcoding (FBC) data into Seurat. For this workshop we will be providing a loaded, filtered, and clustered R data object (.rds file). So, skip ahead to Basic visualization of FBC data for the actual exercise.

  1. Load the matrix files using the standard Read10X function.
data_w_fbc <- Read10X("Sample_A_matrix/", gene.column=1)

NOTE: When the data are loaded by Read10X, it will create a R list with elements for both the gene expression (GEX) data and the feature barcoding (FBC) data. You can use the following command to view the elements in this list. This is not necessary if you already know the type of FBC data included.

names(data_w_fbc)
  1. Load the Gene Expression (GEX) data into a new Seurat object. The GEX data will be stored in the “RNA” assay in the Seurat object.
sc_w_fbc <- CreateSeuratObject(counts = data_w_fbc[['Gene Expression']])
  1. Add the FBC data as an additional assay in the Seurat object. For this example we will load into the “Protein” assay. NOTE: The FBC data does not need to loaded into an assay named “Protein”. The name of this assay can be anything (except RNA) that makes sense for your data.
sc_w_fbc[["Protein"]] <- CreateAssayObject(counts = data_w_fbc[["Antibody Capture"]])

2.4.2 Basic visualization of FBC data

For this exercise we will be providing a loaded, filtered, and clustered R data object. The original data are an example dataset from 10X (https://www.10xgenomics.com/resources/datasets/human-pbmc-from-a-healthy-donor-1-k-cells-v-2-2-standard-4-0-0).

For this Seurat object we had performed the following.

  • Load the 10X GEX with FBC data into Seurat using the commands listed above. FBC data were loaded in to the “Protein” assay.
  • Filter the cells using the thresholds nFeature_RNA > 1000 & nCount_RNA > 3000 & percent.MT < 10.
  • Normalized the data using NormalizeData().
  • Selected the top 4000 features using FindVariableFeatures().
  • Computed PCA object with up to 50 dimensions using RunPCA().
  • Executed FindNeighbors() using the first 10 PCA dimensions.
  • Computed clusters with FindClusters() with a resolution of 1.
  1. Load the pre-made RDS file.
sc_w_fbc <- readRDS(url("https://wd.cri.uic.edu/scrna/sc_fbc.rds"))
  1. Generate basic UMAP plot.
DimPlot(sc_w_fbc, reduction='umap', label=T)

3. List the features included in the FBC (Protein) assay.

row.names(sc_w_fbc[["Protein"]])
 [1] "CD3"          "CD19"         "CD45RA"       "CD4"          "CD8a"        
 [6] "CD14"         "CD16"         "CD56"         "CD25"         "CD45RO"      
[11] "PD-1"         "TIGIT"        "IgG1"         "IgG2a"        "IgG2b"       
[16] "CD127"        "CD15"         "CD197-(CCR7)" "HLA-DR"      

. Generate violin plots including FBC data and grouped by cluster.

VlnPlot(sc_w_fbc, 
    features = c("nFeature_RNA","nCount_RNA", "nCount_Protein"), 
    pt.size=0.2)

5. Normalize the FBC data using a center log ratio (CLR) normalization.

sc_w_fbc <- NormalizeData(sc_w_fbc, assay="Protein", normalization.method="CLR")
Normalizing across features
  1. Select “Protein” as the default assay for Feature and Dot plots
DefaultAssay(sc_w_fbc)
[1] "RNA"
DefaultAssay(sc_w_fbc) <- "Protein"
DefaultAssay(sc_w_fbc)
[1] "Protein"
  1. Create UMAP plots colored by FBC levels
    1. First get a list of the antibody features.
row.names(sc_w_fbc[["Protein"]])
 [1] "CD3"          "CD19"         "CD45RA"       "CD4"          "CD8a"        
 [6] "CD14"         "CD16"         "CD56"         "CD25"         "CD45RO"      
[11] "PD-1"         "TIGIT"        "IgG1"         "IgG2a"        "IgG2b"       
[16] "CD127"        "CD15"         "CD197-(CCR7)" "HLA-DR"      
  1. Use the following four antibody tags for the plots
sel_ab <- c("CD3", "CD4", "CD8a", "IgG1")
  1. Generate the UMAP plot colored by the selected antibody tags
FeaturePlot(sc_w_fbc, features=sel_ab, label=T)

8. Create dot plot of FBC features

DotPlot(sc_w_fbc, features=sel_ab)

  1. Create violin plot of features
VlnPlot(sc_w_fbc, features=sel_ab)

  1. Create heatmap of features
    1. First need to scale the data for the heatmap
    sc_w_fbc <- ScaleData(sc_w_fbc, assay="Protein") 
    Centering and scaling data matrix
    1. Create a heatmap of the selected antibody tags
    DoHeatmap(sc_w_fbc, features=sel_ab)
    1. Create a heatmap of all antibody tags
    DoHeatmap(sc_w_fbc, features=row.names(sc_w_fbc[["Protein"]]))

2.4.2.1 Plot both RNA (gene expression) and FBC features together

  1. Select the FBC (protein) and RNA features for CD4 and CD8a. In these data the RNA features are identified by Ensembl ID. CD4 is ENSG00000010610 and CD8a is ENSG00000153563.
sel_features <- c("protein_CD4", "rna_ENSG00000010610", 
                  "protein_CD8a", "rna_ENSG00000153563")
  1. Generate feature plots.
FeaturePlot(sc_w_fbc, features = sel_features, label=T)

3. Generate dot plots.

DotPlot(sc_w_fbc, features = sel_features) + RotatedAxis()

4. Generate violin plots.

VlnPlot(sc_w_fbc, features = sel_features, ncol=2)

2.4.3 Computing basic statistics of features (TAKE HOME EXERCISE)

2.4.3.1 Generate cell counts for each cluster with antibody counts above a threshold. (Can use violin plots to determine cutoff)

  1. Compute “presence” of each antibody tag if more than 1 normalized (center log ratio) counts. NOTE: In real datasets, you may need to be more sophisticated in choosing the threshold. You may even need to set different thresholds for different antibody tags.
fbc_presence <- t(sc_w_fbc[["Protein"]]@data > 1)
  1. Get cluster IDs for each cell and merge with the FBC presence table.
cluster_ids <- data.frame(Cluster=sc_w_fbc$seurat_clusters)

fbc_presence <- merge(cluster_ids, fbc_presence, by.x=0, by.y=0)
  1. Compute cell counts for presence/absence for an antibody, e.g. CD4
cd4_counts <- table(fbc_presence[,c("Cluster", "CD4")])

cd4_counts
       CD4
Cluster FALSE TRUE
      0    10  213
      1    55  137
      2     6   87
      3    91    0
      4    80    4
      5    37    1
      6    31    6
      7    31    0
      8     7    9

2.4.3.2 Compile a table of CD4+ cell counts and a fraction of cells in each cluster.

  1. Convert the compute table to a data.frame
cd4_counts_df <- as.data.frame(cd4_counts)
  1. Create a side by side (CD4 presence/absence) data.frame
cd4_counts <- merge(subset(cd4_counts_df, CD4 == "TRUE", select=c(Cluster, Freq)),
                    subset(cd4_counts_df, CD4 == "FALSE", select=c(Cluster, Freq)),
                    by.x=1, by.y=1)
  1. Rename the columns for convenience
colnames(cd4_counts)[2:3] <- c("Present", "Absent")
  1. Compute the fraction of cells that are CD4+
cd4_counts$Fraction <- cd4_counts$Present / ( cd4_counts$Present + cd4_counts$Absent )
cd4_counts
  Cluster Present Absent   Fraction
1       0     213     10 0.95515695
2       1     137     55 0.71354167
3       2      87      6 0.93548387
4       3       0     91 0.00000000
5       4       4     80 0.04761905
6       5       1     37 0.02631579
7       6       6     31 0.16216216
8       7       0     31 0.00000000
9       8       9      7 0.56250000

2.5 Incorporating V(D)J data

Information about the format of the filtered_contig_annotations.csv file can be found at https://support.10xgenomics.com/single-cell-vdj/software/pipelines/latest/output/annotation#contig. For this exercise, the following contig annotations were provided.

Column Description
barcode Cell-barcode for this contig.
is_cell True or False value indicating whether the barcode was called as a cell.
contig_id Unique identifier for this contig.
high_confidence True or False value indicating whether the contig was called as high-confidence (unlikely to be a chimeric sequence or some other artifact).
length The contig sequence length in nucleotides.
chain The chain associated with this contig; for example, TRA, TRB, IGK, IGL, or IGH. A value of “Multi” indicates that segments from multiple chains were present.
v_gene The highest-scoring V segment, for example, TRAV1-1.
d_gene The highest-scoring D segment, for example, TRBD1.
j_gene The highest-scoring J segment, for example, TRAJ1-1.
c_gene The highest-scoring C segment, for example, TRAC.
full_length If the contig was declared as full-length.
productive If the contig was declared as productive.
cdr3 The predicted CDR3 amino acid sequence.
cdr3_nt The predicted CDR3 nucleotide sequence.
reads The number of reads aligned to this contig.
umis The number of distinct UMIs aligned to this contig.
raw_clonotype_id The ID of the clonotype to which this cell barcode was assigned.
raw_consensus_id The ID of the consensus sequence to which this contig was assigned.

We will start with a very basic analysis: check which cells have TRA or TRB chains. Depending on the goals of your project, you will also want to dig into the specific CDR3 sequences to compare between cells or between clusters.

2.5.1 Get basic TCR chain statistics for each cluster.

Load the contig annotations into R.

sc_vdj_contigs <- read.csv("https://wd.cri.uic.edu/scrna/tcr_contigs.csv")

Clean up the barcode IDs, so they match the IDs in the Seurat object (missing the trailing numbers).

sc_vdj_contigs$barcode <- sub('-[0-9]+$', '', sc_vdj_contigs$barcode)

Check how many cells have TRA or TRB or both.

# start with all of the cells
all_cells <- sc_w_fbc@meta.data["seurat_clusters"]
all_cells$TRA <- rownames(all_cells) %in% sc_vdj_contigs$barcode[sc_vdj_contigs$chain=="TRA"]
all_cells$TRB <- rownames(all_cells) %in% sc_vdj_contigs$barcode[sc_vdj_contigs$chain=="TRB"]
all_cells$TRAB <- all_cells$TRA & all_cells$TRB
head(all_cells)
                 seurat_clusters   TRA   TRB  TRAB
AAACCTGCAGCCTGTG               6 FALSE FALSE FALSE
AAACGGGTCGCTGATA               2  TRUE  TRUE  TRUE
AAAGCAAAGTATGACA               2  TRUE  TRUE  TRUE
AAATGCCCACTTAAGC               0  TRUE  TRUE  TRUE
AACACGTCAACAACCT               0  TRUE  TRUE  TRUE
AACACGTCAAGCGATG               1 FALSE FALSE FALSE

Summary stats of complete A and B chains per cluster.

table(all_cells[,c("seurat_clusters","TRAB")])
               TRAB
seurat_clusters FALSE TRUE
              0    19  204
              1   191    1
              2     5   88
              3     9   82
              4    84    0
              5     4   34
              6    12   25
              7    31    0
              8    16    0

2.5.2 Visualize TCR chain presence/absence on UMAP plot.

NOTE: This exercise will extract the UMAP plot data from the Seurat object and create a custom plot. It is also possible to add the TCR data to the Seurat object as a secondary assay, akin to the FBC data, and use the Seurat tools.

Get the UMAP coordinates as a data.frame.

umap_coords <- as.data.frame(Embeddings(sc_w_fbc, reduction="umap"))
# Take a peek at the results
head(umap_coords)
                    umap_1    umap_2
AAACCTGCAGCCTGTG -7.553978  2.770488
AAACGGGTCGCTGATA -5.912212 -1.226433
AAAGCAAAGTATGACA -6.879341 -0.258912
AAATGCCCACTTAAGC -6.659776 -5.674999
AACACGTCAACAACCT -4.886473 -5.689088
AACACGTCAAGCGATG  9.418021 -2.538629

Merge the UMAP coordinates and chain information.

plot_data <- merge(all_cells, umap_coords, by=0)
# Take a peek at the results
head(plot_data)
         Row.names seurat_clusters   TRA   TRB  TRAB    umap_1    umap_2
1 AAACCTGCAGCCTGTG               6 FALSE FALSE FALSE -7.553978  2.770488
2 AAACGGGTCGCTGATA               2  TRUE  TRUE  TRUE -5.912212 -1.226433
3 AAAGCAAAGTATGACA               2  TRUE  TRUE  TRUE -6.879341 -0.258912
4 AAATGCCCACTTAAGC               0  TRUE  TRUE  TRUE -6.659776 -5.674999
5 AACACGTCAACAACCT               0  TRUE  TRUE  TRUE -4.886473 -5.689088
6 AACACGTCAAGCGATG               1 FALSE FALSE FALSE  9.418021 -2.538629

Create the basic plot, UMAP plot colored by chain information. The addition of guides(color=guide_legend(override.aes = list(size=3))) will set the size of the plot character in the legend to be a different size than the plot characters used in the plot.

library(ggplot2)
ggplot(plot_data, aes(x=umap_1, y=umap_2, color=TRAB)) +
        geom_point(size=0.5) + theme_classic() +
        guides(color=guide_legend(override.aes = list(size=3)))

ggplot(plot_data, aes(x=umap_1, y=umap_2, color=seurat_clusters)) +
        geom_point(size=0.5) + theme_classic() +
        guides(color=guide_legend(override.aes = list(size=3)))

NOTE: Additional exercise of incorporating V(D)J data into Seurat is found in Extra (Take Home) Exercises.

3 Extra (Take Home) Exercises

The following packages are needed for the take-home exercises:

  • NOTE: scds and SingleCellExperiment are needed to complete the doublet removal exercise we skipped (Exercise 1.3)
  1. From Bioconductor:
BiocManager::install("scds", update=F)
BiocManager::install("SingleCellExperiment", update=F)
BiocManager::install("monocle", update=F)
  1. From CRAN:
install.packages('fossil')
install.packages('cowplot')
install.packages('vegan')
  1. Custom installation of CytoTRACE:
  • Download the CytoTRACE installation source from https://cytotrace.stanford.edu/. For the latest version go to the CytoTRACE website and then click on Install R Package in the sidebar menu.
  • Execute the following commands in R. Modify the PATH/TO/DOWNLOAD to reflect the directory where you download the CytoTRACE installation package. Ensure that devtools is installed (needed for presto before).
if ( ! requireNamespace("devtools"))
   install.packages("devtools")
devtools::install_local("PATH/TO/DOWNLOAD/CytoTRACE_0.3.3.tar.gz")

NOTE: If CytoTRACE could not be installed because ERROR: dependency 'sva' is not available for package 'CytoTRACE', install “sva” using BiocManager.

BiocManager::install("sva", update=F)
  1. Test loading these packages:
library(cowplot)
library(CytoTRACE)
library(fossil)
library(monocle)
library(scds)
library(SingleCellExperiment)
library(vegan)

3.1 Cluster comparisons

3.1.1 First, let’s rerun clustering with two other resolutions

We can use the same object, as each resolution will get saved into its own slot.

sc_subset <- FindClusters(sc_subset, resolution=0.1)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 7027
Number of edges: 255224

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9473
Number of communities: 6
Elapsed time: 0 seconds
sc_subset <- FindClusters(sc_subset, resolution=1)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 7027
Number of edges: 255224

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.8393
Number of communities: 15
Elapsed time: 0 seconds
  • Check how many clusters we got for each one. Remember that higher resolution would mean more clusters
table(sc_subset@meta.data$RNA_snn_res.0.1)

   0    1    2    3    4    5 
3348 2336  475  422  412   34 
table(sc_subset@meta.data$RNA_snn_res.0.5)

   0    1    2    3    4    5    6    7    8    9   10   11 
1259 1086  941  790  580  516  510  421  414  247  228   35 
table(sc_subset@meta.data$RNA_snn_res.1)

   0    1    2    3    4    5    6    7    8    9   10   11   12   13   14 
1061  865  804  575  529  486  466  421  411  395  333  247  228  171   35 

3.1.2 Compare the similarity of the clustering results at a high level with the adjusted Rand index

  • The fossil package was written with archaeological data sets in mind, but has a useful implementation of the rand and adjusted rand indices
  1. Load the fossil library
library(fossil)
  1. Make separate vectors from the different clustering results for convenience
clust0.1 <- sc_subset@meta.data$RNA_snn_res.0.1
clust0.5 <- sc_subset@meta.data$RNA_snn_res.0.5
clust1 <- sc_subset@meta.data$RNA_snn_res.1
  1. Compute the adjusted rand index for each. Not surprisingly, resolutions 0.1 and 1 are the least similar
adj.rand.index(clust0.1, clust0.5)
[1] 0.6545382
adj.rand.index(clust0.1, clust1)
[1] 0.6578186
adj.rand.index(clust0.5, clust1)
[1] 0.8855797

NOTE: If you ended up with a large number of clustering results and you wanted to compute pair-wise similarities across all of them using the adjusted rand index, you would approach this by setting up a data frame with all of the clustering results in it across the columns, then running a loop over all pairs of columns.:

# building a data frame for our 3 clustering results
cluster.df <- data.frame(clust0.1, clust0.5, clust1)
# define a function to make a similarity matrix
cluster_similarity <- function( clust_df ){
   # number of columns we're comparing
   columns <- ncol(clust_df)
   # set up a similarity matrix
   rand.sim <- matrix( nrow=columns, ncol=columns )
   rownames(rand.sim) = colnames(clust_df)
   colnames(rand.sim) = colnames(clust_df)
   # set all values to be 1 first
   # this will keep the diagonal entries 1 after we do the other calculations
   rand.sim[,] <- 1
   # loop over all pairs of columns
   for( i in 1:(columns-1) ){
      for( j in (i+1):columns ){
         # compute the similarity and store it at positions i,j and j,i
         sim <- adj.rand.index( clust_df[,i], clust_df[,j] )
         rand.sim[i,j] <- sim
         rand.sim[j,i] <- sim
      }
   }
   return(rand.sim)
}
# run the function
cluster.sim <- cluster_similarity(cluster.df)
cluster.sim

3.1.3 Detailed comparison between clustering at resolutions 0.25 and 1 using the overlap index

The goal is to compare the actual sets of cells in each clustering result, so that we can see which clusters between the results are related to each other. As a first pass, we can evaluate this qualitatively with two UMAP plots.

DimPlot(sc_subset, reduction='umap', group.by="RNA_snn_res.0.1", label=T)

DimPlot(sc_subset, reduction='umap', group.by="RNA_snn_res.1", label=T)

Quantitative comparison:

  • Write a function to compute the overlap index (then we can re-use it in another analysis if wanted).
  • In the function, we need to write a double loop over both sets of clusters. You don’t need to type the comment lines (#), but they are there to explain our steps along the way.

Source this function to save time:

source("https://wd.cri.uic.edu/scrna/overlap_index.R")

Here is its definition, if you wanted to write it out:

# function for overlap index
overlap_index <- function( cluster1, cluster2 ){
   # make factor vectors from clusters
   c1 = as.factor(cluster1)
   c2 = as.factor(cluster2)
   # define a set of "names" for our cells, which we'll compare between clusters
   # the names are arbitraty, so we'll just number them
   names = 1:length(c1)
   # make distance matrix to store our calculations in
   my_dist = matrix(nrow=length(levels(c1)),ncol=length(levels(c2)))
   rownames(my_dist) = levels(c1)
   colnames(my_dist) = levels(c2)
   for(i in 1:length(levels(c1))){
      for(j in 1:length(levels(c2))){
         # get the list of cells in each cluster
         c1.sub = names[c1==levels(c1)[i]]
         c2.sub = names[c2==levels(c2)[j]]
         # get the intersection and min cluster size
         int = length(intersect(c1.sub,c2.sub))
         minsize = min(length(c1.sub),length(c2.sub))
         # overlap index
         overlap = int/minsize
         # store in matrix
         my_dist[i,j] = overlap
      }
   }
   return(my_dist)
}

Now run the function

res_0.1vs1 <- overlap_index(clust0.1, clust1)
res_0.1vs1
             0          1         2           3           4           5 6 7 8
0 0.9990574929 0.01849711 0.9738806 0.003478261 0.992438563 0.997942387 1 0 0
1 0.0009425071 0.98150289 0.0261194 0.996521739 0.007561437 0.000000000 0 0 0
2 0.0000000000 0.00000000 0.0000000 0.000000000 0.000000000 0.000000000 0 0 0
3 0.0000000000 0.00000000 0.0000000 0.000000000 0.000000000 0.000000000 0 1 0
4 0.0000000000 0.00000000 0.0000000 0.000000000 0.000000000 0.002427184 0 0 1
5 0.0000000000 0.00000000 0.0000000 0.000000000 0.000000000 0.000000000 0 0 0
           9 10 11 12          13         14
0 0.02531646  0  0  0 0.005847953 0.00000000
1 0.97468354  1  0  0 0.994152047 0.00000000
2 0.00000000  0  1  1 0.000000000 0.00000000
3 0.00000000  0  0  0 0.000000000 0.02857143
4 0.00000000  0  0  0 0.000000000 0.00000000
5 0.00000000  0  0  0 0.000000000 1.00000000
  • Plot values in a heatmap. In the heatmap we can see…
    • Cluster 2 in res=0.1 becomes cluster 5 in res=1
    • Cluster 1 in res=0.1 is split into clusters 4, 3, 1 and 6 in res=1
library(ComplexHeatmap)
Heatmap(res_0.1vs1, 
        col = c("white","yellow","red"), 
        name = "Overlap Index",
        column_title = "Resolution 1", 
        row_title = "Resolution 0.1")

3.2 Additional cell typing exercises

3.2.1 Feature plots for gene markers

We may also find it useful to look at the expression of these markers on the UMAP plot. Here’s how we can use Seurat’s FeaturePlot function, but add in our own custom labels:

library(cowplot)
# run FeaturePlot with combine=F so we get all separate plots
# 'myplots' will be a list of ggplot objects
myplots <- FeaturePlot(sc_subset, markers$ID, combine=F)
# for each plot in the list, add a new title with the description
for(i in 1:length(myplots)){
  myplots[[i]] = myplots[[i]] + ggtitle(markers_description[i])
}
# then make a new combined plot with plot_grid
plot_grid(plotlist=myplots, ncol=3)

3.2.2 Marker-based automated cell typing with scType

scType can be run by simply sourcing a function from their GitHub page.

The database for scType is a named list of marker genes for each cell type, which we will make from our marker list above. You can also provide negative markers in a separate list if desired.

For more information, see the scType documentation: https://sctype.app.

# source the scType functions from github
source("https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/R/sctype_score_.R")
# prepare gene sets using the marker list above
celltypes <- unique(markers$Cell.Type)
celltypes_genes <- sapply(celltypes, function(x){
  subset(markers, Cell.Type==x)$ID})
# this is a named list of gene markers for each cell type
celltypes_genes
$`B cells`
[1] "ENSG00000105369" "ENSG00000156738" "ENSG00000119866"

$Monocytes
[1] "ENSG00000142089" "ENSG00000197746" "ENSG00000101439" "ENSG00000090382"

$`NK cells`
[1] "ENSG00000115523" "ENSG00000203747"

$`T cells`
[1] "ENSG00000167286" "ENSG00000198851" "ENSG00000160654" "ENSG00000188404"
# also include CD3 genes as negative markers for NK cells
negative_markers <- list("NK cells"=markers$ID[grep("^CD3",markers$Marker)])
# obtain normalized expression for these genes
norm <- GetAssayData(sc_subset, layer="data")
norm <- as.matrix(norm[markers$ID,])
# run scType
# scaled=F means that we are supplying normalized, but not z-scored data
#   and scType will run z-scoring then
# the "gs2" parameter is for negative markers (optional in general)
sctype_result <- sctype_score( norm, scaled=F,
  gs=celltypes_genes, gs2=negative_markers )
# transpose and peek at results
sctype_result <- t(sctype_result)
head(sctype_result)
                         B cells  Monocytes  NK cells    T cells
s1_AAACGGGAGACTGTAA-1  6.5444141 -0.2445015  2.310882 -1.9307517
s1_AAACGGGTCTCAAACG-1 -0.4459658 -0.8862130 -2.178354  1.6114778
s1_AAAGTAGGTAGCAAAT-1 -0.4459658 -0.8862130 -1.481604  1.3904256
s1_AAAGTAGTCCGCAAGC-1 -0.4459658  1.0578332 -2.198394  0.9068864
s1_AAATGCCTCTTGAGGT-1 -0.4459658  0.2940607 -1.348341  0.1707191
s1_AACACGTCAACTGCTA-1 -0.4459658 -0.3488254 -1.844020  1.3543735
# assign a cell type per cell based on the max score
celltypes <- apply(sctype_result, 1, function(x){
  colnames(sctype_result)[which.max(x)]})
# count the number of each cell type per cluster
cluster_types <- data.frame(
  Cluster = sc_subset@meta.data$RNA_snn_res.0.5,
  Type = celltypes)
cluster_summary <- table(cluster_types)
cluster_summary
       Type
Cluster B cells Monocytes NK cells T cells
     0       46        95      272     846
     1       55        99      246     686
     2       42        84      275     540
     3       34       104      128     524
     4       17        62       84     417
     5       24        41       59     392
     6       32        46       81     351
     7        2       417        2       0
     8        0         1      411       2
     9      237         0       10       0
     10     221         0        7       0
     11       0        23       12       0
# obtain the majority cell type per cluster
cluster_type <- apply(cluster_summary,1,function(x){
  colnames(cluster_summary)[which.max(x)]})
cluster_type
          0           1           2           3           4           5 
  "T cells"   "T cells"   "T cells"   "T cells"   "T cells"   "T cells" 
          6           7           8           9          10          11 
  "T cells" "Monocytes"  "NK cells"   "B cells"   "B cells" "Monocytes" 
# assign these cell types in the Seurat object
sc_subset@meta.data$CellType <- cluster_type[sc_subset@meta.data$RNA_snn_res.0.5]
DimPlot(sc_subset, group.by = "CellType")

# compare to the clusters
DimPlot(sc_subset, reduction='umap', label=T)

Compare this to what you would have assigned based on the marker genes in Exercise 2.2, as well as to the label transfer results in that exercise.

3.3 Additional cluster-to-cluster differential tests and visualizations

3.3.1 Visualizations for DEGs

Make nicer versions of these plots, with a little extra work.

Violin plots:

library(cowplot)
myplots <- VlnPlot(sc_subset, features=top5_cluster5$gene, combine=F)
for(i in 1:length(myplots)){
  myplots[[i]] = myplots[[i]] + ggtitle(top5_cluster5$symbol[i])
}
plot_grid(plotlist=myplots)

Dotplots:

DotPlot(sc_subset, features=top5_cluster5$gene ) +
  scale_x_discrete(labels=function(x) genenames[x,1])

Heatmap:

Option 1, using Seurat’s DoHeatmap (this will still omit genes that were not z-scaled - not part of the variable features).

DoHeatmap(sc_subset, features=top5$gene) +
  scale_y_discrete(labels=function(x) genenames[x,1])

Option 2, using ComplexHeatmap

# get normalized expression data of interest
norm.data <- GetAssayData(sc_subset,layer="data")
norm.data <- norm.data[top5$gene,]
# z-score
norm.data <- t(scale(t(norm.data)))
# order the columns by cluster
columns.clusters <- sc_subset@meta.data$seurat_clusters
columns.order <- order(columns.clusters)
norm.data <- norm.data[,columns.order]
columns.clusters <- columns.clusters[columns.order]

# make heatmap
library(ComplexHeatmap)
library(circlize)
col_fun <- colorRamp2(breaks=c(-2,0,2),colors=c("blue","white","red"))
column_annotation <- HeatmapAnnotation(cluster=columns.clusters)

Heatmap(norm.data, name="Z-score", col=col_fun,
  cluster_columns=F, show_column_names=F, column_split = columns.clusters,
  cluster_rows=F, row_labels=top5$symbol)

3.3.2 Other ways to run differential expression

  • You can use the function FindMarkers (instead of FindAllMarkers) to test specific subsets of clusters.
  • With the FindMarkers function, you can use the group.by parameter to test based on, e.g., orig.ident.
  1. If you wanted to test between a specific pair of clusters.
cluster1vs2 <- FindMarkers(sc_subset, ident.1=1, ident.2=2, group.by="seurat_clusters")
  1. If you wanted to test between one cluster vs the combination of two other clusters.
cluster1vs2and3 <- FindMarkers(sc_subset, ident.1=1, ident.2=c(2,3),
  group.by="seurat_clusters")
  1. If you wanted to test between predicted cell types instead, first change the cell identities (Idents).
# check we've assigned cell types from the label transfer analysis
sc_subset@meta.data$CellType <- prediction_type[sc_subset@meta.data$RNA_snn_res.0.5]
Idents(sc_subset) <- "CellType"
stats_between_types <- FindAllMarkers(sc_subset, test.use="wilcox")
Calculating cluster Bcells
Calculating cluster CD8Tcells
Calculating cluster CD4Tcells
Calculating cluster Monocytes
Calculating cluster NKcells

3.4 Differential stats as pseudo-bulk

  • Subset the cells to a cluster of interest.
  • Then sum the counts per sample, which will yield a counts table.
  • Run analysis as usual in edgeR.

NOTE: use a loop in R to perform this analysis over all clusters.

library(tidyr)
sc_cluster1 <- subset(sc_subset, subset = seurat_clusters == 1)
samples <- unique(sc_cluster1$orig.ident)
counts <- GetAssayData(sc_cluster1,layer="counts")
cluster1_counts <- sapply(samples, function(x){
  # boolean vector indicating which cells are in sample x
  sample.cells <- sc_cluster1$orig.ident == x
  # sum counts over all of these cells
  sample.counts <- Matrix::rowSums(counts[,sample.cells])
  return(sample.counts)
})
# this is the pseudo-bulk counts table for this cluster
head(cluster1_counts)
                V035_F031 V039_F093 V040_F198 V041_F043 V043_F047 V044_F037
ENSG00000243485         0         0         0         0         0         0
ENSG00000237613         0         0         0         0         0         0
ENSG00000186092         0         0         0         0         0         0
ENSG00000238009         0         0         1         0         0         0
ENSG00000239945         0         0         0         0         0         0
ENSG00000237683         0         0         0         0         1         0
                V045_F120 V054_F209
ENSG00000243485         0         0
ENSG00000237613         0         0
ENSG00000186092         0         0
ENSG00000238009         0         0
ENSG00000239945         0         0
ENSG00000237683         0         1
# now run differential expression statistics in edgeR
library(edgeR)
# obtain metadata and match column order
metadata <- read.delim("https://wd.cri.uic.edu/scrna/GSE138867_metadata.txt",row.names=1)
metadata <- metadata[colnames(cluster1_counts),]
# subset genes to those expressed
cluster1_counts_use = cluster1_counts[rowSums(cluster1_counts)>20,]
# prepare edgeR object
cluster1_dge <- DGEList(cluster1_counts_use, group=metadata)
# normalization and dispersion
cluster1_dge <- calcNormFactors(cluster1_dge)
cluster1_dge <- estimateDisp(cluster1_dge)
Using classic mode.
# check the BCV level
sqrt(cluster1_dge$common.dispersion)
[1] 0.2676988
# run differential stats
cluster1_result <- exactTest(cluster1_dge)$table
cluster1_result$QValue <- p.adjust(cluster1_result$PValue)
head(cluster1_result)
                      logFC   logCPM    PValue QValue
ENSG00000228463 -0.06724653 3.454524 1.0000000      1
ENSG00000230368  0.44547704 3.416732 1.0000000      1
ENSG00000188976  0.26854915 5.370629 0.7806707      1
ENSG00000187608  0.37810777 6.713287 0.3632735      1
ENSG00000186891  0.02712532 4.657649 0.4791354      1
ENSG00000186827  0.06737773 5.734380 0.6599663      1
# count how many are significant
table(cluster1_result$QValue < 0.05)

FALSE  TRUE 
 7925     5 
# comparison of DEGs here and from single-cell comparison
colnames(cluster1_result) <- paste0("Pseudobulk_",colnames(cluster1_result))
deg_comparison <- merge(cluster1_result, cluster1_stats, by=0)
subset(deg_comparison, Pseudobulk_QValue < 0.05)
           Row.names Pseudobulk_logFC Pseudobulk_logCPM Pseudobulk_PValue
6829 ENSG00000188536         4.025540          7.579265      3.331712e-14
7308 ENSG00000206172         4.356953          6.600759      2.683781e-21
7565 ENSG00000236016         3.098049          5.191447      5.921193e-09
7652 ENSG00000244734         5.194182          8.778717      3.886196e-31
7733 ENSG00000255823         2.926671          5.027914      5.819543e-08
     Pseudobulk_QValue        p_val avg_log2FC pct.1 pct.2    p_val_adj
6829      2.641382e-10 5.914654e-16   4.155458 0.229 0.049 4.209966e-13
7308      2.127970e-17 3.328512e-03   3.770804 0.069 0.029 1.355351e-01
7565      4.693730e-05 3.224137e-11   2.957707 0.139 0.022 1.235712e-08
7652      3.081754e-27 2.882615e-20   4.821283 0.251 0.040 5.745051e-17
7733      4.612570e-04 2.997342e-16   3.260734 0.171 0.016 2.489043e-13
         symbol
6829       HBA2
7308       HBA1
7565 RP1-3J17.3
7652        HBB
7733   MTRNR2L8

3.5 Pseudotime

3.5.1 Load the libraries and data (if necessary)

  1. Load the Seurat (needed to manipulate the previously create Seurat object) and monocle for the pseudotime analysis
# load both libraries, if not done already
library(Seurat)
library(monocle)
  1. Load the previously analyzed Seurat object, if needed.
sc_subset <- readRDS(url("https://wd.cri.uic.edu/scrna/clustered.rds"))

3.5.2 Get Monocle object from Seurat

Monocle has a function called “importCDS” that is supposed to import from Seurat. But it doesn’t work. Fun. We have created a fixed version of the function, which you can source from public-data.

Pro tip: If you want to see the code of a function, just type its name and hit enter. E.g., type “importCDS” to see the code for this function in Monocle2. After you’ve sourced our R script, you can type “importCDS2” to see our version.

# source the R script
source("https://wd.cri.uic.edu/scrna/importCDS2.R")
# use the importCDS2 function (our function) instead of importCDS (monocle function)
monocle_data <- importCDS2(sc_subset)
Warning in .sparse2dense(e1): sparse->dense coercion: allocating vector of size
1.7 GiB
Warning in .sparse2dense(e2): sparse->dense coercion: allocating vector of size
1.7 GiB
monocle_data
CellDataSet (storageMode: environment)
assayData: 32738 features, 7027 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: s1_AAACGGGAGACTGTAA-1 s1_AAACGGGTCTCAAACG-1 ...
    s8_TTTGTCAGTGACGGTA-1 (7027 total)
  varLabels: orig.ident nCount_RNA ... Size_Factor (10 total)
  varMetadata: labelDescription
featureData
  featureNames: ENSG00000243485 ENSG00000237613 ... ENSG00000215611
    (32738 total)
  fvarLabels: gene_short_name
  fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation:  

3.5.3 Start processing in Monocle

We need to tell Monocle which genes to base the analysis on. Monocle has a set of steps to do this, similar to how we picked the variable genes before. But we can just use the variable genes we’ve already selected.

  1. Estimate size factors and pick which genes to use
monocle_data <- estimateSizeFactors(monocle_data)
# get variable genes from Seurat object
top.genes <- VariableFeatures(sc_subset)
monocle_data <- setOrderingFilter(monocle_data, top.genes)
  1. Reduce dimensions - this step may take several minutes.
monocle_data <- reduceDimension(monocle_data, max_components=2, method='DDRTree')
  1. Infer the cell ordering (pseudotime). This command may take a minute or two to run.
monocle_data <- orderCells(monocle_data)

3.5.4 Try different visualizations

  • Default coloring is by “state”, the segment of the tree we’re in this tree doesn’t have any branches, so there is only 1 state
plot_cell_trajectory(monocle_data)

  • Can also color by the pseudotime
plot_cell_trajectory(monocle_data, color_by = "Pseudotime")

  • Color by seurat cluster
plot_cell_trajectory(monocle_data, color_by = "RNA_snn_res.0.5")

3.5.5 Further exploration of the data.

The pData function from the monocle package will return a data.frame with the computed psuedotime values. As the original Seurat data were also imported, e.g. clusters, original idents.

head(pData(monocle_data))
                      orig.ident nCount_RNA nFeature_RNA  group percent.MT
s1_AAACGGGAGACTGTAA-1  V035_F031       5201         1237 Smoker   3.653144
s1_AAACGGGTCTCAAACG-1  V035_F031       3909         1169 Smoker   3.223331
s1_AAAGTAGGTAGCAAAT-1  V035_F031       5026         1335 Smoker   1.929964
s1_AAAGTAGTCCGCAAGC-1  V035_F031       4087         1259 Smoker   4.061659
s1_AAATGCCTCTTGAGGT-1  V035_F031       5521         1292 Smoker   1.829379
s1_AACACGTCAACTGCTA-1  V035_F031       3614          804 Smoker   2.102933
                      IsDoublet RNA_snn_res.0.5 seurat_clusters Size_Factor
s1_AAACGGGAGACTGTAA-1     FALSE              10              10   1.3580499
s1_AAACGGGTCTCAAACG-1     FALSE               2               2   1.0206916
s1_AAAGTAGGTAGCAAAT-1     FALSE               0               0   1.3123551
s1_AAAGTAGTCCGCAAGC-1     FALSE               3               3   1.0671698
s1_AAATGCCTCTTGAGGT-1     FALSE               3               3   1.4416061
s1_AACACGTCAACTGCTA-1     FALSE               0               0   0.9436632
                      Pseudotime State
s1_AAACGGGAGACTGTAA-1   9.067058     1
s1_AAACGGGTCTCAAACG-1  18.841316     2
s1_AAAGTAGGTAGCAAAT-1  15.121057     3
s1_AAAGTAGTCCGCAAGC-1  19.718525     2
s1_AAATGCCTCTTGAGGT-1  15.586364     2
s1_AACACGTCAACTGCTA-1  19.458620     3
  • Can plot time vs gene expression.
# Get the top marker gene for cluster 7
topgene = degs$gene[degs$cluster==7][1]
# plot using gene expression as size of dot
plot_cell_trajectory(monocle_data, markers = topgene, use_color_gradient=T)

  • One can extract plot coordinates from trajectory plot. Note that monocle_data@reducedDimW has “whitened” cell coordinates in the reduced dimension space. Analogous to the loadings from PCA
head(t(monocle_data@reducedDimS))
                            [,1]       [,2]
s1_AAACGGGAGACTGTAA-1  5.8585574  2.6289045
s1_AAACGGGTCTCAAACG-1 -0.6354290 -2.8018766
s1_AAAGTAGGTAGCAAAT-1 -0.0216304  1.8470248
s1_AAAGTAGTCCGCAAGC-1 -0.4642292 -3.6755653
s1_AAATGCCTCTTGAGGT-1 -0.1627147  0.5634199
s1_AACACGTCAACTGCTA-1 -3.9413785  3.6908672

3.6 Statistics with Pseudotime

3.6.1 Look for genes that are highly correlated with pseudotime

  • We’ll test just a subset of genes them to save computational time.
  • We’re comparing two continuous variables, so a correlation or regression analysis is appropriate.

First we can use the built-in regression test from Monocle, which smooths the data using a spline fit.

  1. Get a list of genes to test. Usually you might test all the genes, or just the variable features. This example will test just a subset of the variable features.
to_test <- top.genes[1:20]
  1. Subset the monocole data to only include the specified genes
monocle_subset <- monocle_data[to_test,]
  1. Perform differential gene test. sm.ns fits a spline fit through the expression values to smooth them a bit. differentialGeneTest is then running a regression analysis against the smoothed expression.
pseudotime_test <- differentialGeneTest(monocle_subset, 
   fullModelFormulaStr="~sm.ns(Pseudotime)")
head(pseudotime_test)
                status           family pval qval gene_short_name
ENSG00000163220     OK negbinomial.size    0    0 ENSG00000163220
ENSG00000143546     OK negbinomial.size    0    0 ENSG00000143546
ENSG00000090382     OK negbinomial.size    0    0 ENSG00000090382
ENSG00000169429     OK negbinomial.size    0    0 ENSG00000169429
ENSG00000101439     OK negbinomial.size    0    0 ENSG00000101439
ENSG00000204287     OK negbinomial.size    0    0 ENSG00000204287
                use_for_ordering
ENSG00000163220             TRUE
ENSG00000143546             TRUE
ENSG00000090382             TRUE
ENSG00000169429             TRUE
ENSG00000101439             TRUE
ENSG00000204287             TRUE

Alternatively, we can do the test with a Spearman correlation.

pseudotime <- pData(monocle_data)$Pseudotime
# for time purposes, we'll do the first 20 genes again
expression <- GetAssayData(sc_subset,layer="data")[to_test,]
# run test with apply statement, making our own short function to return the 
# correlation coefficient and p-value from cor.test
# we're going to transpose the result because otherwise the genes are going
# along the columns, and keep it as a data frame
spearman.corrs <- data.frame(t(apply(expression,1,function(x){
   r=cor.test(x,pseudotime,method="spearman"); return(c(r$estimate, r$p.value))})))
colnames(spearman.corrs) = c("rho","p.value")
# add FDR correction
q.value <- p.adjust(spearman.corrs$p.value,method="fdr")
spearman.corrs <- cbind(spearman.corrs,q.value)
# sort by p-value
spearman.corrs <- spearman.corrs[order(spearman.corrs$p.value),]
# see what the final table looks like
head(spearman.corrs)
                       rho       p.value       q.value
ENSG00000204287 -0.3862747 7.867704e-249 1.573541e-247
ENSG00000163221 -0.3309131 3.480297e-179 3.452046e-178
ENSG00000085265 -0.3307610 5.178069e-179 3.452046e-178
ENSG00000257764 -0.3247887 2.590923e-172 1.295462e-171
ENSG00000115523  0.3133590 6.562340e-160 2.624936e-159
ENSG00000101439 -0.2724192 8.204193e-120 2.734731e-119

We can also plot gene expression relative to pseudotime

library(ggplot2)
# plot the top gene, adding in a LOESS curve fit to non-zero values
top_gene <- rownames(spearman.corrs)[1]
plot_data <- data.frame(Pseudotime=pseudotime, Gene=c(t(expression[top_gene,])))
ggplot(plot_data, aes(x=Pseudotime, y=Gene)) + geom_point() +
  labs(x="Pseudotime",y="Scaled Expression", title=top_gene) +
  geom_smooth(method="loess")
`geom_smooth()` using formula = 'y ~ x'

3.6.2 Compare our cell clusters with pseudotime

  • We’re comparing categorical and continuous variables, so a Kruskal-Wallis/ANOVA type test is appropriate.
clusters <- sc_subset@meta.data$RNA_snn_res.0.5
pseudotime <- pData(monocle_data)$Pseudotime
cluster_vs_time <- data.frame(clusters, pseudotime)
ggplot(cluster_vs_time, aes(x=clusters, y=pseudotime)) +
  geom_boxplot()

There is clearly a strong association of cluster vs time. Further tests may be helpful if you want to test the significance of which clusters come earlier or later with respect to pseudotime. For now, we can do an overall test using Kruskal Wallis.

kruskal.test(pseudotime ~ clusters)

    Kruskal-Wallis rank sum test

data:  pseudotime by clusters
Kruskal-Wallis chi-squared = 4920.9, df = 11, p-value < 2.2e-16

3.7 CytoTRACE

CytoTRACE (Cellular (Cyto) Trajectory Reconstruction Analysis using gene Counts and Expression) is a computational method that predicts the differentiation state of cells from single-cell RNA-sequencing data. More details on this package can be found at https://cytotrace.stanford.edu/

3.7.1 Installing CytoTRACE

  1. Install “sva” using BiocManager (dependency for CytoTRACE)
if (! requireNamespace("BiocManager"))
  install.packages("BiocManager")

BiocManager::install("sva", update=F)
  1. Download CytoTRACE installation source from https://cytotrace.stanford.edu/. NOTE: For the latest version go to the CytoTRACE website and then click on Install R Package in the sidebar menu. In the installation instructions there will be a link to the current version of CytoTRACE. At the time of writing this exercise, the current version is 0.3.3.

If you are using a Linux or macOS system, you can download the installation package using the following command.

wget https://cytotrace.stanford.edu/CytoTRACE_0.3.3.tar.gz
  1. Execute the following commands in R. Modify the PATH/TO/DOWNLOAD to reflect the directory where you download the CytoTRACE installation package.
if ( ! requireNamespace("devtools"))
  install.packages("devtools")

devtools::install_local("PATH/TO/DOWNLOAD/CytoTRACE_0.3.3.tar.gz")
  1. (OPTIONAL) If you plan to use CytoTRACE to analyze across multiple batches/datasets you will need to install the Python packages scanoramaCT and numpy. If you are using a Linux or macOS system, you can execute the following commands to install these packages
pip install scanoramaCT
pip install numpy

3.7.2 Running CytoTRACE

  1. Load the Seurat (if not already loaded) and CytoTRACE libraries
library(Seurat)
library(CytoTRACE,verbose=F)
Welcome to the CytoTRACE R package, a tool for the unbiased prediction of differentiation states in scRNA-seq data. For more information about this method, visit https://cytotrace.stanford.edu.
  1. Load the desired Seurat object If it is NOT currently loaded.
sc_obj <- readRDS("clustered.rds")
  1. (OPTIONAL) You can subset your Seurat object to clusters of interest, for example, clusters 1 and 2.
sc_obj.1_2 <- subset(sc_obj, subset = RNA_snn_res.0.5 == 1 | RNA_snn_res.0.5 == 2)
  1. Get counts matrix of your Seurat object. Either the whole object or the subset from the previous step. In this example, we are just going to use the subset.
sc_counts <- GetAssayData(sc_obj,layer="counts")
  1. Run CytoTRACE analysis (check methods for what enableFast does if you want to enable it)
cyto_results <- CytoTRACE(as.matrix(sc_counts), enableFast=F)
The number of cells in your dataset exceeds 3,000. CytoTRACE will now be run
in fast mode (see documentation). You can multi-thread this run using the
'ncores' flag. To disable fast mode, please indicate 'enableFast = FALSE'.
CytoTRACE will be run on 1 sub-sample(s) of approximately 7027 cells each using 1 / 1 core(s)
Pre-processing data and generating similarity matrix...
Calculating gene counts signature...
Smoothing values with NNLS regression and diffusion...
Calculating genes associated with CytoTRACE...
Done
Warning message:
In CytoTRACE(sc_counts, enableFast = F) :
  12950 genes have zero expression in the matrix and were filtered

3.7.3 Analyzing the CytoTRACE results

For this exercise, we will add the results from CytoTRACE back into the Seurat object. This will allow us to use a number of the existing Seurat functions to analyze and visualize the CytoTRACE results.

sc_obj@meta.data$CytoTRACE <- cyto_results$CytoTRACE

Visualize the CytoTRACE time variable on a UMAP plot

FeaturePlot(sc_obj, reduction='umap', features='CytoTRACE')

Differential expression with respect to CytoTRACE time

  1. Remove genes with < 25% of cells expressing
# Get the count of cells expressing the gene.   
# The code sum(x>0) is a shortcut to get a count of items greater than zero (0)
counts_sum = apply(sc_counts, 1, function(x) sum(x>0) )
Warning in asMethod(object): sparse->dense coercion: allocating vector of size
1.7 GiB
# Get the genes in which the count is greater than 25% of the number of cells.
genes_keep = counts_sum > ( 0.25 * ncol(sc_counts) )
# Subset the gene counts data to just those genes.
genes_data = GetAssayData(sc_obj,layer="data")[genes_keep,]
# Take a peek at the number of genes (rows) in the filtered data.
dim(genes_data)
[1] 1069 7027
  1. Run a correlation analysis on the normalized expression levels and CytoTRACE time. For this example, we will use the Kendall’s Tau test as it can better handle if there are ties in the data.

NOTE We will get warnings with the Spearman test about ties.

# Use an apply function to run the Spearman test (correlation) for each row.
# We transpose the results as the apply function will combine the output of each 
# iteration column wise and we want to have it by row and save as a data.frame
cytotrace_corr <- data.frame(t(apply(genes_data, 1, function (x) {
  res <- cor.test(x, sc_obj@meta.data$CytoTRACE, method="spearman")
  return(c(res$estimate, res$p.value))
})))

# Set the column names
colnames(cytotrace_corr) <- c("estimate", "PValue")

# Add FDR corrected PValue
cytotrace_corr$QValue <- p.adjust(cytotrace_corr$PValue, method="BH")

# Sort by absolute value of the correlation estimate
cytotrace_corr <- cytotrace_corr[order(abs(cytotrace_corr$estimate), decreasing = T), ]

# Peek at the first few results
head(cytotrace_corr)
                  estimate PValue QValue
ENSG00000075624  0.6781119      0      0
ENSG00000177954 -0.5962351      0      0
ENSG00000122026 -0.5909023      0      0
ENSG00000144713 -0.5856175      0      0
ENSG00000145425 -0.5789766      0      0
ENSG00000196154  0.5684408      0      0

Differential time with respect to cluster or sample

  1. Compile the basic CytoTRACE time stats for each cluster and sample. In this example, we will compute the mean and standard deviation of the CytoTRACE time for each cluster in each sample.
library(dplyr)
sc_obj@meta.data %>% 
  group_by(RNA_snn_res.0.5, group) %>%
  summarize(mean=mean(CytoTRACE), sd=sd(CytoTRACE))
`summarise()` has grouped output by 'RNA_snn_res.0.5'. You can override using
the `.groups` argument.
# A tibble: 24 × 4
# Groups:   RNA_snn_res.0.5 [12]
   RNA_snn_res.0.5 group       mean    sd
   <fct>           <chr>      <dbl> <dbl>
 1 0               Non-Smoker 0.181 0.152
 2 0               Smoker     0.279 0.186
 3 1               Non-Smoker 0.464 0.199
 4 1               Smoker     0.505 0.206
 5 2               Non-Smoker 0.479 0.213
 6 2               Smoker     0.696 0.210
 7 3               Non-Smoker 0.518 0.258
 8 3               Smoker     0.580 0.240
 9 4               Non-Smoker 0.376 0.194
10 4               Smoker     0.380 0.220
# ℹ 14 more rows
  1. Generate a box plot of the CytoTRACE times for each cluster and sample
library(ggplot2)
ggplot(sc_obj@meta.data, aes(x=orig.ident, y= CytoTRACE, fill=group)) + 
  theme_classic() + 
  geom_boxplot() + 
  facet_wrap(~ RNA_snn_res.0.5) +
  labs(x="", y="CytoTRACE time", fill="Sample") +
  theme(axis.text.x=element_text(angle=90))

  1. Perform statistical test(s) of CytoTRACE time as compared with cluster or sample.
    1. Kruskall-Wallis test of CytoTRACE time vs. cluster
    kruskal.test(CytoTRACE ~ RNA_snn_res.0.5, data = sc_obj@meta.data)
    
        Kruskal-Wallis rank sum test
    
    data:  CytoTRACE by RNA_snn_res.0.5
    Kruskal-Wallis chi-squared = 3566.6, df = 11, p-value < 2.2e-16
    1. Compute differences between the smoker/non-smoker groups for each cluster
    cluster_stats <- sc_obj@meta.data %>% group_by(RNA_snn_res.0.5) %>%
      do(w=wilcox.test(CytoTRACE ~ group, data=.)) %>%
      summarize(ClusterID=RNA_snn_res.0.5, PValue=w$p.value)
    head(cluster_stats)
    # A tibble: 6 × 2
      ClusterID   PValue
      <fct>        <dbl>
    1 0         2.35e-19
    2 1         1.31e- 3
    3 2         3.85e-46
    4 3         5.55e- 4
    5 4         9.01e- 1
    6 5         1.95e- 1

3.8 Incorporate TCR data into Seurat object

3.8.1 Convert and load the TCR data into the Seurat object

NOTE: The first five (5) steps are the same as in exercise 2.5.1.

  1. Load the contig annotations into R.
sc_vdj_contigs <- read.csv("https://wd.cri.uic.edu/scrna/tcr_contigs.csv")
  1. Clean up the barcode IDs. Seurat will strip the trailing numbers from the cell barcodes, if present.
sc_vdj_contigs$barcode <- sub('-[0-9]+$', '', sc_vdj_contigs$barcode)
  1. Compute the chain counts for each cell.
# First get a list of the detected chains
table(sc_vdj_contigs$chain)

Multi   TRA   TRB 
    6   578   614 
# Then compute the chain counts per cell
vdj_chain_counts <- as.data.frame(table(sc_vdj_contigs[, c("barcode", "chain")]))
  1. Add presence/absence for the chains.
vdj_chain_counts$presence <- vdj_chain_counts$Freq > 0

# Take a peek at the results
head(vdj_chain_counts)
           barcode chain Freq presence
1 AAACGGGTCGCTGATA Multi    0    FALSE
2 AAAGCAAAGTATGACA Multi    0    FALSE
3 AAATGCCCACTTAAGC Multi    0    FALSE
4 AACACGTCAACAACCT Multi    0    FALSE
5 AACACGTGTTATCCGA Multi    0    FALSE
6 AACACGTGTTATCGGT Multi    0    FALSE
  1. Get the cluster IDs for each cell as a data.frame.
cluster_ids <- data.frame(cluster=Idents(sc_w_fbc))

# Take a peek at the results
head(cluster_ids)
                 cluster
AAACCTGCAGCCTGTG       6
AAACGGGTCGCTGATA       2
AAAGCAAAGTATGACA       2
AAATGCCCACTTAAGC       0
AACACGTCAACAACCT       0
AACACGTCAAGCGATG       1
  1. Convert the chain counts into “wide” format.
library(tidyr)
vdj_cell_chains <- pivot_wider(vdj_chain_counts[, c("barcode", "chain", "Freq")],
  names_from="chain", values_from="Freq")

dim(vdj_cell_chains)
[1] 482   4
head(vdj_cell_chains)
# A tibble: 6 × 4
  barcode          Multi   TRA   TRB
  <fct>            <int> <int> <int>
1 AAACGGGTCGCTGATA     0     1     1
2 AAAGCAAAGTATGACA     0     1     1
3 AAATGCCCACTTAAGC     0     1     1
4 AACACGTCAACAACCT     0     2     3
5 AACACGTGTTATCCGA     0     2     1
6 AACACGTGTTATCGGT     0     1     1
  1. Merge with the cell cluster IDs. This is primarily to add rows for all the cells.
vdj_cell_chains <- merge(cluster_ids, vdj_cell_chains, 
                         by.x=0, by.y=1, all.x=T)

dim(vdj_cell_chains)
[1] 805   5
  1. Set the row names as the cell IDs (barcodes).
row.names(vdj_cell_chains) <- vdj_cell_chains[,1]
  1. Drop the first two columns, i.e. barcode and cluster ID, and convert to matrix
vdj_cell_chains <- as.matrix(vdj_cell_chains[, c(-1, -2)])

head(vdj_cell_chains)
                 Multi TRA TRB
AAACCTGCAGCCTGTG    NA  NA  NA
AAACGGGTCGCTGATA     0   1   1
AAAGCAAAGTATGACA     0   1   1
AAATGCCCACTTAAGC     0   1   1
AACACGTCAACAACCT     0   2   3
AACACGTCAAGCGATG    NA  NA  NA
  1. Set any NA values to be zero (no chains)
vdj_cell_chains[is.na(vdj_cell_chains)] <- 0
  1. Add the data to the Seurat object.

NOTE: the CreateAssayObject expects a counts table with features (chains) on the rows and cell on the columns. So, we will need to transpose the matrix when we create the new assay object.

sc_w_fbc[["TCR"]] <- CreateAssayObject(counts=t(vdj_cell_chains))
  1. (OPTIONAL) Save the modified Seurat object to an .rds file for repeated use.
saveRDS(sc_w_fbc, file="seurat_obj_w_tcr.rds")

3.8.2 Create TCR feature plot in Seurat.

  1. Select the TCR assay as the default assay for plotting.
DefaultAssay(sc_w_fbc) <- "TCR"
  1. Create a UMAP feature plot using the TRA and TRB chain counts.
FeaturePlot(sc_w_fbc, features = c("TRA", "TRB"), label=T)

  1. Create a Seurat dot plot using the TRA and TRB chain counts.
DotPlot(sc_w_fbc, features=c("TRA", "TRB"))

  1. Can also make plot using data from multiple assays. Just add a prefix of the assay name in lowercase.
DotPlot(sc_w_fbc, features=c("tcr_TRA", "tcr_TRB", "protein_CD4", "rna_ENSG00000010610")) +
  RotatedAxis()

3.9 Diversity analysis of V(D)J data

This example will use the library vegan to compute diversity of V(D)J+C annotations as compared with different biological samples.

  1. Load the contig annotations into R.
sc_vdj_contigs <- read.delim("https://wd.cri.uic.edu/scrna/filtered_contig_annotations.txt")
  1. Clean up the barcode IDs. Seurat will strip the trailing numbers from the cell barcodes, if present.
sc_vdj_contigs$barcode <- sub('-[0-9]+$', '', sc_vdj_contigs$barcode)
  1. Get the cluster IDs and original sample IDs for each cell as a data.frame. In this case we are providing these values to you, but in practice you could obtain them from the Seurat object’s metadata.
cluster_ids <- read.delim("https://wd.cri.uic.edu/scrna/cell_info.txt", row.names=1)
head(cluster_ids)
                            cluster     sample
Sample_001_AAACCTGAGCTGTCTA      14 Sample_001
Sample_001_AAACCTGCAAAGGAAG      29 Sample_001
Sample_001_AAACCTGGTGAGGGTT       9 Sample_001
Sample_001_AAACGGGGTGAACCTT      12 Sample_001
Sample_001_AAACGGGGTTGATTCG       8 Sample_001
Sample_001_AAACGGGTCCTATTCA      13 Sample_001
  1. Get the selected V(D)J data from the loaded contig data. The following are number of options to extract V(D)J data.
    • Get all CDR3 amino acid sequences
vdj_data <- subset(sc_vdj_contigs, select=c(barcode, cdr3))
head(vdj_data)
                      barcode            cdr3
1 Sample_001_AAACCTGAGCTGTCTA CASSDLGAGTGQLYF
2 Sample_001_AAACCTGAGCTGTCTA CAVRDPPGNTRKLIF
3 Sample_001_AAACCTGGTGAGGGTT  CASRSGTGDYEQYF
4 Sample_001_AACGTTGAGTAGCCGA    CALSESGGKLTL
5 Sample_001_AACGTTGAGTAGCCGA CASSLKTGGYAEQFF
6 Sample_001_AACTTTCAGGGAAACA CASSLKTGGYAEQFF
  • Get all CDR3 amino acid sequences for TRA chains
vdj_data <- subset(sc_vdj_contigs, chain=="TRA", select=c(barcode, cdr3))
head(vdj_data)
                       barcode            cdr3
2  Sample_001_AAACCTGAGCTGTCTA CAVRDPPGNTRKLIF
4  Sample_001_AACGTTGAGTAGCCGA    CALSESGGKLTL
7  Sample_001_AACTTTCAGGGAAACA    CALSESGGKLTL
9  Sample_001_AACTTTCAGTTACCCA CALSDLDYSNNRLTL
11 Sample_001_AACTTTCCATTCTTAC CAASMRGSALGRLHF
14 Sample_001_AAGGAGCGTAAACGCG   CAVRASSGQKLVF
  • Get combined V(D)J+C annotations.
vdj_data <- data.frame(barcode=sc_vdj_contigs$barcode, 
        vdj_annotation=paste(sc_vdj_contigs$v_gene, sc_vdj_contigs$d_gene, 
            sc_vdj_contigs$j_gene, sc_vdj_contigs$c_gene, sep="."))
head(vdj_data)
                      barcode             vdj_annotation
1 Sample_001_AAACCTGAGCTGTCTA    TRBV13-1..TRBJ2-2.TRBC2
2 Sample_001_AAACCTGAGCTGTCTA         TRAV1..TRAJ37.TRAC
3 Sample_001_AAACCTGGTGAGGGTT TRBV19.TRBD1.TRBJ2-7.TRBC2
4 Sample_001_AACGTTGAGTAGCCGA      TRAV6N-7..TRAJ44.TRAC
5 Sample_001_AACGTTGAGTAGCCGA      TRBV29..TRBJ2-1.TRBC2
6 Sample_001_AACTTTCAGGGAAACA      TRBV29..TRBJ2-1.TRBC2
  • Get combined V(D)J+C annotations for TRA chains.
tra_data <- subset(sc_vdj_contigs, chain=="TRA")
vdj_data <- data.frame(barcode=tra_data$barcode, 
        vdj_annotation=paste(tra_data$v_gene, tra_data$d_gene, 
            tra_data$j_gene, tra_data$c_gene, sep="."))
head(vdj_data)
                      barcode        vdj_annotation
1 Sample_001_AAACCTGAGCTGTCTA    TRAV1..TRAJ37.TRAC
2 Sample_001_AACGTTGAGTAGCCGA TRAV6N-7..TRAJ44.TRAC
3 Sample_001_AACTTTCAGGGAAACA TRAV6N-7..TRAJ44.TRAC
4 Sample_001_AACTTTCAGTTACCCA   TRAV6-6..TRAJ7.TRAC
5 Sample_001_AACTTTCCATTCTTAC  TRAV9-1..TRAJ18.TRAC
6 Sample_001_AAGGAGCGTAAACGCG    TRAV1..TRAJ16.TRAC
  1. Merge the cluster and sample information with V(D)J data. You could use any of the above subsets of the V(D)J data for this analysis. For this example we will use the previous example of V(D)J+C annotations for all chains.
vdj_data <- merge(cluster_ids, vdj_data, by.x=0, by.y=1)
head(vdj_data)
                    Row.names cluster     sample             vdj_annotation
1 Sample_001_AAACCTGAGCTGTCTA      14 Sample_001    TRBV13-1..TRBJ2-2.TRBC2
2 Sample_001_AAACCTGAGCTGTCTA      14 Sample_001         TRAV1..TRAJ37.TRAC
3 Sample_001_AAACCTGGTGAGGGTT       9 Sample_001 TRBV19.TRBD1.TRBJ2-7.TRBC2
4 Sample_001_AACGTTGAGTAGCCGA       4 Sample_001      TRAV6N-7..TRAJ44.TRAC
5 Sample_001_AACGTTGAGTAGCCGA       4 Sample_001      TRBV29..TRBJ2-1.TRBC2
6 Sample_001_AACTTTCAGGGAAACA       4 Sample_001      TRAV6N-7..TRAJ44.TRAC
  1. Compute counts for each annotation and sample. For this example, we are using the V(D)J+C annotations data.
library(dplyr)

vdj_counts <- vdj_data %>% group_by(sample, vdj_annotation) %>% summarize(counts=n())
`summarise()` has grouped output by 'sample'. You can override using the
`.groups` argument.
head(vdj_counts)
# A tibble: 6 × 3
# Groups:   sample [1]
  sample     vdj_annotation       counts
  <chr>      <chr>                 <int>
1 Sample_001 TRAV1..TRAJ16.TRAC        3
2 Sample_001 TRAV1..TRAJ37.TRAC        1
3 Sample_001 TRAV10D..TRAJ26.TRAC      1
4 Sample_001 TRAV10D..TRAJ37.TRAC      1
5 Sample_001 TRAV10D..TRAJ45.TRAC      1
6 Sample_001 TRAV10N..TRAJ7.TRAC       1
  1. Convert counts data.frame to a matrix with samples on the rows and V(D)J+C annotations on columns.
library(tidyr)
vdj_counts_mat <- pivot_wider(vdj_counts, names_from="vdj_annotation", values_from="counts", values_fill=0)

head(vdj_counts_mat)[,1:4]
# A tibble: 6 × 4
# Groups:   sample [6]
  sample     TRAV1..TRAJ16.TRAC TRAV1..TRAJ37.TRAC TRAV10D..TRAJ26.TRAC
  <chr>                   <int>              <int>                <int>
1 Sample_001                  3                  1                    1
2 Sample_002                  0                  1                    0
3 Sample_003                  0                  0                    0
4 Sample_004                  0                  0                    1
5 Sample_005                  0                  0                    0
6 Sample_006                  0                  0                    0
# The first column has the sample IDs.
# So, set as the row names and drop the first column when converting to matrix.
# Convert to a data frame first so we can add rownames
vdj_counts_mat <- as.data.frame(vdj_counts_mat)
row.names(vdj_counts_mat) <- vdj_counts_mat[,1]
vdj_counts_mat <- as.matrix(vdj_counts_mat[,-1, drop=F])

head(vdj_counts_mat)[,1:3]
           TRAV1..TRAJ16.TRAC TRAV1..TRAJ37.TRAC TRAV10D..TRAJ26.TRAC
Sample_001                  3                  1                    1
Sample_002                  0                  1                    0
Sample_003                  0                  0                    0
Sample_004                  0                  0                    1
Sample_005                  0                  0                    0
Sample_006                  0                  0                    0
  1. Compute the diversity indices for each sample. Will compute Shannon entropy (S) and richness, a.k.a. number of unique entities (N).
library(vegan)
Loading required package: permute
Loading required package: lattice
This is vegan 2.6-6.1

Attaching package: 'vegan'
The following object is masked from 'package:VGAM':

    calibrate
vdj_diversity <- data.frame(sample=row.names(vdj_counts_mat), 
            S=diversity(vdj_counts_mat, index="shannon"),
            N=specnumber(vdj_counts_mat))

head(vdj_diversity)
               sample        S   N
Sample_001 Sample_001 4.640368 156
Sample_002 Sample_002 4.087640  90
Sample_003 Sample_003 5.001865 220
Sample_004 Sample_004 4.956640 292
Sample_005 Sample_005 5.119334 299
Sample_006 Sample_006 4.551837 199

Once the diversity indices are generated these can be compared with the experimental factors for your biological samples using basic statistical tests, e.g. Kruskal-Wallis or Mann-Whitney. The diversity indices could also be plotted using dot/box plots as a function of particular experimental factors.

3.9.1 Comparing and visualization of diversity indices.

For these example we used the following mapping information

mapping <- read.delim("https://wd.cri.uic.edu/scrna/mapping.txt")
head(mapping)
      Sample Group
1 Sample_001     A
2 Sample_002     B
3 Sample_003     D
4 Sample_004     C
5 Sample_005     C
6 Sample_006     C
  • Merge the mapping with the compute diversity indices.
vdj_diversity <- merge(vdj_diversity, mapping, by.x=1, by.y=1)

# Take a peek at the results
head(vdj_diversity)
      sample        S   N Group
1 Sample_001 4.640368 156     A
2 Sample_002 4.087640  90     B
3 Sample_003 5.001865 220     D
4 Sample_004 4.956640 292     C
5 Sample_005 5.119334 299     C
6 Sample_006 4.551837 199     C
  • Perform statistical comparison using Kruskal-Wallis test
# Comparing difference in computed Shannon entropy (S) with group
kruskal.test(S ~ Group, vdj_diversity)

    Kruskal-Wallis rank sum test

data:  S by Group
Kruskal-Wallis chi-squared = 7.2051, df = 3, p-value = 0.06564
# Comparing difference in richness (N) with group
kruskal.test(N ~ Group, vdj_diversity)

    Kruskal-Wallis rank sum test

data:  N by Group
Kruskal-Wallis chi-squared = 7.2051, df = 3, p-value = 0.06564
  • Visualization of diversity indices
boxplot(S ~ Group, vdj_diversity)

boxplot(N ~ Group, vdj_diversity)

  • Using ggplot2
library(ggplot2)

ggplot(vdj_diversity, aes(x=Group, y=S)) + geom_boxplot()

ggplot(vdj_diversity, aes(x=Group, y=N)) + geom_boxplot()

NOTE: It may be important to subsample/rarefy the counts data to account for any diversity indices that may be due to the fact that some samples had higher cell counts than others. The following is an example of subsample the data to 500 counts per sample.

vdj_counts_200 <- rrarefy(vdj_counts_mat, 200)
Warning in rrarefy(vdj_counts_mat, 200): some row sums < 'sample' and are not
rarefied
# Filter out any samples that were not subsampled to a depth of 200 (have less than 200 counts)
vdj_counts_200 <- vdj_counts_200[ rowSums(vdj_counts_200) == 200, ]

# Compute the diversity indices for each sample
# Will compute Shannon entropy (S) and richness, a.k.a. number of unique entities (N)
vdj_diversity <- data.frame(sample=row.names(vdj_counts_200), 
            S=diversity(vdj_counts_200, index="shannon"),
            N=specnumber(vdj_counts_200))

head(vdj_diversity)
               sample        S   N
Sample_001 Sample_001 4.396868 105
Sample_003 Sample_003 4.581617 127
Sample_004 Sample_004 4.318617 124
Sample_005 Sample_005 4.501748 119
Sample_006 Sample_006 4.188451 105
Sample_008 Sample_008 4.262527 112